Durable Java

Mark Davis / http://www.macchiato.com/


Serialization

Durable Java | Immutables | Abstraction | Serialization | Liberté, Égalité, Fraternité | Hashing and Cloning

[Note to the editor: here is the contents of the left-hand first-page information. Both it and the column title have changed.]

Design your code from the start to be durable--so it can evolve without breaking your clients' code.

Dr. Mark Davis is lead architect at IBM's Center for Java Technology, Silicon Valley, co-founder and president of the Unicode Consortium, and architect for the bulk of JDK1.1 internationalization.

[Note to the editor: end of first-page left-hand information.]

In this column we are going to take up serialization. This is the mechanism you use in Java when you want to store a Java object in a persistent form on disk, if you are going to use RMI, or you are going to send it over the net. It may also be used behind your back in other parts of the system--such as in JavaBeans--or when someone is serializing a class that has your class as a data member.

Supporting serialization in any operating system is not a trivial engineering task: there are a number of complications that make it difficult to balance robustness, speed, size, and usability. The Javasoft engineers did a pretty good job of achieving a reasonable balance among these different factors, and have supplied excellent specifications and API documentation. (See the references for more information.)

For most classes, Java serialization is pretty easy to use, but there are some definite gotchas that stand in the way of making durable code. If you don't watch it, your serialization will easily break across different versions. You are also heavily dependent on the "kindness of strangers". If any of the classes you use in your objects do not serialize properly across their versions of software, then your class won't either. This includes the standard Java classes, plus any third-party classes that you use, plus your own classes.

Serialization demands not only backward compatibility, but forward compatibility. You might have people running slightly older versions of software needing to read the newer versions, and vice versa. Serialization incompatibilities are even more serious than API incompatibilities: we are talking about customers not being able to read data that they innocently stored away with a previous version of your software!

[Note to the typesetter: the contents of all the boxes in the article should not be modified. If it is too large for the space, don't line wrap or change the spacing--just use a slightly smaller font. Also, please place the boxes very near to where they are now in the text flow.]


Memory to Disk

"Look, what thy memory can not contain..." -- Sonnet LXXVII

In the normal case, it is pretty simple to make a class serializable. All you have to do is have it implement Serializable, and mark any fields that shouldn't persist with the keyword transient. Look at the example below, where these additions are shown in blue.

Enabling Serialization

class MyClass implements Serializable {
    private String firstName;
    private String familyName;
    private float height;
    private Point location;
    transient private Hashtable cache;
    ...

Once you do that, anyone can store your object out on the disk or send it over the wire. You can also use serialization in memory to flatten your object into a simple array of bytes. The code to store Serializables out or read them back into memory is pretty simple. The following examples show writing an object myClass out to disk, and later reading it back in again.

Writing Objects

OutputStream os = new FileOutputStream(fileName);
ObjectOutput oo = new ObjectOutputStream(os);
oo.writeObject(myClass);
oo.close();

Reading Objects

InputStream is = new FileInputStream(fileName);
ObjectInput oi = new ObjectInputStream(is);
Object newObj = oi.readObject();
oi.close();

System.out.println(
  myClass.equals(newObj) ? "WIN" : "LOSE");

[Note to the editor: make this a sidebar]

Poor Man's Enum

Java doesn't provide language support for defining an enum: that is, a type-safe set of constants. Most often, people just use static final ints instead, and blow off type-safety. But in many places in Java 2, a poor-man's enum is used instead. This is a immutable class where no constructors are public, and the only access to objects of that class are as static finals. An example of this is Character.UnicodeBlock.

This provides type-safety for users of this enum. Since the class limits the construction of the objects, an equality check is also very fast; just a check for identity. Some disadvantages of this scheme are that you can't use poor-man's enums in switch statements, their serialized form is pretty bulky, and you have to add special code to preserve this identity relationship.

That is, you will need to keep a list (e.g. a hash table) of all the objects in the enum. When you deserialize an object, you check which of the listed enum values it corresponds to, then return the listed enum object instead of the deserialized object.

[Note to the editor: end of sidebar.]

That all seems rather simple, and it is fairly flexible. However, there are a couple of gotchas even if you are not worried about being durable across versions:

 

Evolving Classes

"Before the times of change, still is it so:
By a divine instinct men's minds mistrust
Ensuing dangers"--King Richard III

The real issues come in when you produce a new version of your software and you need to alter MyClass. The serialization spec introduces a useful term for such a new version of your class: they call it an evolved class. The first consideration when you evolve a class is make sure that serialization will still consider it to be the same class as the old one. Even the simplest change to the class requires you to add a magic constant, serialVersionUID tailored for your class. This requirement is in place to prevent accidentally evolved classes. The value of this constant is set by running the program serialver (distributed with your JDK) on your original class.

static final long serialVersionUID 
  = -6756364686697947626L;

Once you've done this, adding fields is pretty simple. Serialization will take care of the bookkeeping for you; the only thing you have to do is to make sure that your new fields have reasonable default values, since that's what will be used if they come in from old versions.

[Note to the editor: make this a sidebar]

64K Limit on Strings

The reason that Strings have a 64K limit is that a fixed two-byte length is written out before the data.

In a previous job, our system of serialization would write out length values with a variable number of bytes: one bit in each byte was reserved to mean "continue". With that kind of format, most strings would be a byte shorter, and longer strings would be possible. Unfortunately it is too late to change that in Java.

[Note to the editor: end of sidebar.]

Making other sorts of changes is not nearly so convenient. Once you mark your class with Serializable, you cast your basic object structure in concrete, and you can't change it easily in future versions. (And you can never remove Serializable once your class is marked by it.) Here are some of the things you can't do without some work:

For example, when it came time for us to upgrade NumberFormat to handle BigDecimal and BigInteger, we found a problem. It has fields that represent the minimum and maximum integer and fraction digits. These fields were defined as bytes--perfectly adequate for double, but just plain wrong for these new unlimited-precision numbers. So what do you do?

The most important step is to not make this mistake in the first place. Before you ever simply tag an object with Serializable, review the field structure carefully to make sure that you aren't doing something dumb.

But let's suppose that some other coders weren't as smart as you, and you now own their problem code. The simplest way do do it is just to add additional, redundant fields. For example, we could have added additional int fields like intMinDigits, and retained the old byte fields like minDigits. We need to add some magic methods for reading and writing, but that is pretty simple. Using the pattern below, we set the values of the old fields to some sort of reasonable values before they are stored out. When we read in objects, we test to see if the new field is valid; otherwise we set it to the value of the old field.

Duplicating Old Serialization Behavior

private void writeObject(ObjectOutputStream out)
  throws IOException {
    minDigits = 
     (intMinDigits > 127) ? 127 : intMinDigits;
    out.defaultWriteObject();
}

private void readObject(ObjectInputStream in) 
  throws IOException, ClassNotFoundException {
    in.defaultReadObject();
    if (intMinDigits == -1) {
      intMinDigits = minDigits;
    }
}

If the cost of the redundency in memory is too high, Java 2 does give you an out: you can override the serialization of the class by supplying some more magic code. In the following example, we had a byte field that needs to be changed to an int. To duplicate the same format as the old class so that the old version of the code can read it, we need to set up some special objects that tell serialization exactly what the old format was, so it can duplicate it. These basically define virtual fields that are serialized instead of your object's actual fields. In the example below, we are replacing a byte field called minDigits with an int field called intMinDigits.

Here is what is going on. In serialPersistentFields, we build a list of all the old fields, including their names and types. When writing, for the value of the old field (that doesn't exist in our evolved object), we find the largest value that we can fit in a byte, and put that in the set of virtual fields (along with all the new fields). After filling all the fields, we call writeFields to write out all the fields.

When reading it back in again, we call readFields to get the data into the list of virtual fields, then extract the data from there to put it into our real fields. If the new field does not exist in the list of virtual fields (e.g. fields.defaulted(intMinDigits) == true), then we use the old field value. With this mechanism, we can write objects that can be read by either old or evolved classes, and read objects serialized by either old or evolved classes.

Duplicating Old Serialization Behavior

private static final ObjectStreamField[] // magic
  serialPersistentFields = { 
    new ObjectStreamField("minDigits", Byte.TYPE), 
    ... define all other old fields ...
    new ObjectStreamField("intMinDigits", Integer.TYPE), 
    ... define all other additional fields ...
}; 

private void writeObject(ObjectOutputStream out)
  throws IOException {
    ObjectOutputStream.PutField fields 
      = out.putFields(); // magic
    byte minDigits = 
     (intMinDigits > 127) ? 127 : intMinDigits;
    fields.put("minDigits", minDigits);
    ... put other old fields ...
    fields.put("intMinDigits", intMinDigits);
    ... put other new fields ...
    out.writeFields();
}

private void readObject(ObjectInputStream in) 
  throws IOException, ClassNotFoundException {
    ObjectInputStream.GetField
      fields = in.readFields();     // magic
    if (fields.defaulted(intMinDigits)) {
      intMinDigits = fields.get("minDigits", (byte)0);
    } else {
      intMinDigits = fields.get("intMinDigits", (int)0);
    }
    ... get other old fields ...
}

You can use this mechanism for other reasons. For example, if your strings might be larger than 64K. The simplest thing to do is to define a virtual field that is an array of strings consisting of successive 64K-character substrings of your original, and use that instead of your actual string field.

If you are still working on Java 1.1, PutField and GetField are not available to you; your best choice is probably to just keep the redundant fields. Otherwise, you just have write out the information in the proper format by yourself. This is pretty tricky, since you don't have full control over what gets written inside of your readObject or writeObject.

Size Matters

"It must be an answer of most monstrous size
that must fit all demands."--All's Well That Ends Well

I first tried using serialization fairly early. I have some Java tools that I use for reading in particular versions of the Unicode character database, and providing an object interface to that data. The information is about 500K stored in a set of flat text files with very simple formats: just semicolon-delimited text representing data of various kinds associated with Unicode characters. At the time, I was just reading in the data, parsing the file, converting numeric text into numbers, etc, and storing the result in a set of objects in a hash table. This process takes a couple of seconds in a non-debug version, but about 15 seconds in the debug version.

I finally got tired of waiting for it to load each time, and decided to use the new serialization capabilities. I would just read the whole thing in once, then serialize it out to disk. The next time in, I could just serialize the whole thing in with one fell swoop. This way, I would avoid all the parsing that I was doing before. Making this change was simple, and I serialized the data out to disk. I was now prepared to say goodbye to waiting for parsing. I kicked it off, and waited...and waited...and waited... About two minutes later, it was done loading. I looked at the disk, and that 500K of data grew to be a serialized file of about 3Meg. What happened?

Let's take a simple case, an object of type MyClass with the above definition, containing data {"Mark", "Davis", 188.5, {1,1}}. With about 23 bytes of data, it turns into 200 bytes when serialized. As it turns out, the basic serialization mechanism stores all kinds of information in the file so that it can deserialize without any other assistance (see below).

Serialized Data

Length: 200
Magic: ACED
Version: 5
  OBJECT
    CLASSDESC
    Class Name: "MyClass"
    Class UID:  -C39EDC726B866EBL
    Class Desc Flags: SERIALIZABLE;
    Field Count: 4
    Field type: float
    Field name: "height"
    Field type: object
    Field name: "familyName"
    Class name: "Ljava/lang/String;"
    Field type: object
    Field name: "firstName"
    Class name: "Ljava/lang/String;"
    Field type: object
    Field name: "location"
    Class name: "Ljava/awt/Point;"
    Annotation: ENDBLOCKDATA
    Superclass description: NULL
   float: 188.5
   STRING: "Davis"
   STRING: "Mark"
   OBJECT
      CLASSDESC
      Class Name: "java.awt.Point"
      Class UID:  -493B758DCB8137DAL
      Class Desc Flags: SERIALIZABLE;
      Field Count: 2
      Field type: integer
      Field name: "x"
      Field type: integer
      Field name: "y"
      Annotation: ENDBLOCKDATA
      Superclass description: NULL
     integer: 1
     integer: 1

The situation is not as bad as it first seems; if you have multiple objects of the same type in a stream some of the overhead is shared. Still, it is unacceptable for many situations, both in terms of storage size and of speed. In a previous job, we had a similar system of serialization. In our case, we were more concerned about size and performance than anything else, and squeezed down hard on the number of bytes in the serialization. But this came at the expense of the programmers--to make classes serializable required the programmer to explicitly write out all their fields.

Sun chose to make serialization simple, at the expense of size. Luckily, they provide an escape valve here as well, allowing you to take full responsibility for writing out the contents of your class and for managing changes over versions. You do this by making your class Externalizable instead of Serializable, and supplying two different magic routines. Here is the result of serializing the same class, modified to be externalizable. Notice that the actual data is not parseable externally any more--only your class knows the meaning of the data!

Externalized Data

Length: 54
Magic: ACED
Version: 5
  OBJECT
    CLASSDESC
    Class Name: "MyClass2"
    Class UID:  5CB3777417A3AB5BL
    Class Desc Flags: EXTERNALIZABLE;
    Field Count: 0
    Annotation
      ENDBLOCKDATA
    Superclass description
      NULL
  EXTERNALIZABLE: 
   [70 00 04 4D 61 72 6B 00 05 44 61 76 69 73 43 3C
    80 00 00 00 00 01 00 00 00 01]

Here are some further things you can do to reduce memory size if you need to:

With this new-found capability also come some very significant constraints.

JVM Version Issues

"Begot of nothing but vain fantasy,
Which is as thin of substance as the air
And more inconstant than the wind,..."--Romeo and Juliet

Before getting into some of the version issues, we will take a quick look at format of a stream. It is summarized as the BNF in the table below (for a full description of the format, see Sun's specification). Each of the uppercased words represents a special constant tag, usually a byte. In this description, the mark @ indicates that the current object (which may be a class description) is stored in a cache while serializing. Any new occurrence of that same object in the stream is not stored again; instead, it is represented by a reference (a tag plus an int). This has two benefits. First, it preserves the relationships where a single object is contained in multiple other objects. Second, it reduces the size of the stream significantly. The # mark indicates a reset, where this cache is flushed.

Serialized Data Format

stream = STREAM_MAGIC STREAM_VERSION item* 
item =
   OBJECT        classDesc @ classdata[]
 | CLASS         classDesc @
 | ARRAY         classDesc @ iCount values[iCount]
 | STRING        @ utf
 | CLASSDESC     classInfo
 | REFERENCE     intHandle
 | EXCEPTION     # object #
 | BLOCKDATA     bCount (byte)[bCount] 
 | BLOCKDATALONG iCount (byte)[iCount]
 | NULL
 | RESET

classdata =
 | item*                       // S&!W
 | item* item* ENDBLOCKDATA    // S&W
 | externalContent*            // E&!B (in P1)
 | item* item* ENDBLOCKDATA    // E&B  (in P2)

classDesc =
   CLASSDESC classInfo
 | NULL
 | REFERENCE intHandle

The classInfo contains a host of information about the class: name, serialization UID, flags describing the serialization, its field names and types, optional annotations, and descriptions of its superclasses. Some of this information is omitted if the class is externalized; otherwise, it is used to determine the normal format of the classdata. The relation between the flags and the classdata format is indicated in the table by the flags' initial letters: SERIALIZABLE, WRITE_METHOD, EXTERNALIZABLE, and BLOCK_DATA. For example, E&B means that the EXTERNALIZABLE and BLOCK_DATA flags are both on.

The format for Externalizable has undergone a significant change, one that is marked in the data by having the BLOCK_DATA flag turned on. This flag is off by default in JDK 1.1, and on by default in JVM 1.2. In JVM 1.1.7 and later you can read both versions, but older JVMs will not understand the newer format. If you are using a newer version of the JVM, you must call the following method when streaming, or else an Externalizable class will not be readable on older JVM. (Unfortunately, it appears that you cannot set it on a per-class basis, so it is the user of the class that has to do this--the class itself has no control.)

ObjectOutputStream.useProtocolVersion(
  ObjectStreamConstants.PROTOCOL_VERSION_1);

It is not uncommon to have problems with serialization across different versions of the JVM. If this happens to you, the first thing to check is that you properly defined a serialVersionUID. If you did, and you are still having problems, walk through the list of items above to make sure you didn't violate any of the constraints. Check that you are using the right protocol version when streaming. If you define magic methods, make sure that the contents correspond between the read and write versions, and that you are using them with the corresponding interface (Externalizable vs. Serializable). Make sure that you properly handle the fields in these magic methods so that old and evolved classes can coexist. Check that the method access is correct on the magic methods (some are required to be private, not public).

If none of that solves the problem, check Sun's specification, which goes into much more detail as to the API and serialization process. If you still have problems, they may be outside your code; some class that you are using may be incorrectly defined, so scout through those classes for problems. If you find such a problem class and you can't change its source, you'll have to use the magic methods to read and write its fields manually.


Wrap-up

Java serialization provides a powerful yet generally simple mechanism for object serialization. Blindly marking objects as Serializable, however, will get you into trouble in your next version, so prepare to spend a little up-front time to make your classes durable.

I had promised, in my last column, to take up the subject of changing method names. Serialization is a big topic, and even in the space of whole column I can't do it justice, so method names are pushed off to next time. We will see then why doing something as simple as adding a new method--with a new name--can break your clients!


 

Resources

Sun provides excellent documentation and specifications on serialization, along with some well-annotated examples; see http://java.sun.com/products/jdk/1.2/docs/guide/serialization

For more about "poor man's enum", see White, Eric, "Enumerating Types with Class", Java Report 2(8) Sept.
1997, p51-55

Previous columns are archived at http://www.macchiato.com/columns/

Copyright (c) 1999, Mark Davis, All rights reserved.