Java RMI: Serialization
Versioning Classes
A few pages back, I described the serialization mechanism:
The serialization mechanism automatically, at
runtime, converts class objects into metadata so instances can be serialized
with the least amount of programmer work.
This is great as long as the classes don't change. When classes
change, the metadata, which was created from obsolete class objects,
accurately describes the serialized information. But it might not correspond
to the current class implementations.
The Two Types of Versioning Problems
There are two basic types of versioning problems that can occur.
The first occurs when a change is made to the class hierarchy (e.g., a
superclass is added or removed). Suppose, for example, a personnel application
made use of two serializable classes:
Employeeand
Manager(a subclass of
Employee). For the next version of the application, two
more classes need to be added:
Contractorand
Consultant. After careful thought, the new hierarchy is
based on the abstract superclass
Person, which has
two direct subclasses:
Employeeand
Contractor.
Consultantis
defined as a subclass of
Contractor, and
Manageris a subclass of
Employee. See Figure
10-8.

Figure 10-8. Changing the class hierarchy.
|
While introducing
Personis probably
good object-oriented design, it breaks serialization. Recall that
serialization relied on the class hierarchy to define the data format.
The second type of version problem arises from local changes to
a serializable class. Suppose, for example, that in our bank example, we want
to add the possibility of handling different currencies. To do so, we define a
new class,
Currency, and change the definition of
Money:
public class Money extends ValueObject {
public float amount;
public Currency typeOfMoney;
}
This completely changes the definition of
Moneybut doesn't change the object hierarchy at all.
The important distinction between the two types of versioning
problems is that the first type can't really be repaired. If you have old data
lying around that was serialized using an older class hierarchy, and you need
to use that data, your best option is probably something along the lines of
the following:
- Using the old class definitions, write an application
that deserializes the data into instances and writes the instance data out
in a neutral format, say as tab-delimited columns of text.
- Using the new class definitions, write a program that
reads in the neutral-format data, creates instances of the new classes, and
serializes these new instances.
The second type of versioning problem, on the other hand, can be
handled locally, within the class definition.
How Serialization Detects When a Class Has Changed
In order for serialization to gracefully detect when a
versioning problem has occurred, it needs to be able to detect when a class
has changed. As with all the other aspects of serialization, there is a
default way that serialization does this. And there is a way for you to
override the default.
The default involves a hashcode. Serialization creates a single
hashcode, of type
long, from the following
information:
- The class name and modifiers
- The names of any interfaces the class implements
- Descriptions of all methods and constructors except
privatemethods and constructors
- Descriptions of all fields except
private,
static, and
private transient
This single
long, called the class's
stream unique identifier (often abbreviated
suid),
is used to detect when a class changes. It is an extraordinarily sensitive
index. For example, suppose we add the following method to
Money:
public boolean isBigBucks( ) {
return _cents > 5000;
}
We haven't changed, added, or removed any fields; we've simply
added a method with no side effects at all. But adding this method changes the
suid. Prior to adding it, the
suidwas
6625436957363978372L;
afterwards, it was
-3144267589449789474L. Moreover,
if we had made
isBigBucks( )a protected method,
the
suidwould have been
4747443272709729176L.
TIP: These numbers can be computed using the
serialVer program that ships with the JDK. For example, these were all
computed by typing
serialVer
com.ora.rmibook.chapter10.Moneyat the command line for slightly
different versions of the
Moneyclass.
The default behavior for the serialization mechanism is a
classic "better safe than sorry" strategy. The serialization mechanism uses
the
suid, which defaults to an extremely sensitive
index, to tell when a class has changed. If so, the serialization mechanism
refuses to create instances of the new class using data that was serialized
with the old classes.
Implementing Your Own Versioning Scheme
While this is reasonable as a default strategy, it would be
painful if serialization didn't provide a way to override the default
behavior. Fortunately, it does. Serialization uses only the default
suidif a class definition doesn't provide one. That is,
if a class definition includes a
static final longnamed
serialVersionUID, then serialization will use
that
static
final longvalue as the
suid. In the case of our
Moneyexample, if we included the line:
private static final long serialVersionUID = 1;
in our source code, then the
suidwould be 1, no matter how many changes we made to the rest of the class.
Explicitly declaring
serialVersionUIDallows us to
change the class, and add convenience methods such as
isBigBucks( ), without losing backwards compatibility.
TIP:
serialVersionUIDdoesn't have to be private. However,
it must be
static,
final, and
long.
The downside to using
serialVersionUIDis that, if a significant change is made
(for example, if a field is added to the class definition), the
suidwill not reflect this difference. This means that
the deserialization code might not detect an incompatible version of a class.
Again, using
Moneyas an example, suppose we
had:
public class Money extends ValueObject {
private static final long serialVersionUID = 1;
protected int _cents;
and we migrated to:
public class Money extends ValueObject {
private static final long serialVersionUID = 1;
public float amount;
public Currency typeOfMoney;
}
The serialization mechanism won't detect that these are
completely incompatible classes. Instead, when it tries to create the new
instance, it will throw away all the data it reads in. Recall that, as part of
the metadata, the serialization algorithm records the name and type of each
field. Since it can't find the fields during deserialization, it simply
discards the information.
The solution to this problem is to implement your own versioning
inside of
readObject( )and
writeObject( ). The first line in your
writeObject( )method should begin:
private void writeObject(java.io.ObjectOutputStream out) throws IOException {
stream.writeInt(VERSION_NUMBER);
....
}
In addition, your
readObject( )code
should start with a switch statement based on the version number:
private void readObject(java.io.ObjectInputStream in) throws IOException,
ClassNotFoundException {
int version = in.readInt( );
switch(version) {
// version specific demarshalling code.
....}
}private void readObject(java.io.ObjectInputStream in) throws IOException,
ClassNotFoundException {
int version = in.readInt( );
switch(version) {
// version specific demarshalling code.
....}
}
Doing this will enable you to explicitly control the versioning
of your class. In addition to the added control you gain over the
serialization process, there is an important consequence you ought to consider
before doing this. As soon as you start to explicitly version your classes,
defaultWriteObject( )and
defaultReadObject( )lose a lot of their usefulness.
Trying to control versioning puts you in the position of
explicitly writing all the marshalling and demarshalling code. This is a
trade-off you might not want to make.
Performance Issues
Serialization is a generic marshalling and demarshalling
algorithm, with many hooks for customization. As an experienced programmer,
you should be skeptical--generic algorithms with many hooks for customization
tend to be slow. Serialization is not an exception to this rule. It is, at
times, both slow and bandwidth-intensive. There are three main performance
problems with serialization: it depends on reflection, it has an incredibly
verbose data format, and it is very easy to send more data than is required.
Serialization Depends on Reflection
The dependence on reflection is the hardest of these to
eliminate. Both serializing and deserializing require the serialization
mechanism to discover information about the instance it is serializing. At a
minimum, the serialization algorithm needs to find out things such as the
value of
serialVersionUID, whether
writeObject( )is implemented, and what the superclass
structure is. What's more, using the default serialization mechanism, (or
calling
defaultWriteObject( )from within
writeObject( )) will use reflection to discover all the
field values. This can be quite slow.
TIP: Setting
serialVersionUIDis a simple, and often surprisingly
noticeable, performance improvement. If you don't set
serialVersionUID, the serialization mechanism has to
compute it. This involves going through all the fields and methods and
computing a hash. If you set
serialVersionUID, on
the other hand, the serialization mechanism simply looks up a single value.
Serialization Has a Verbose Data Format
Serialization's data format has two problems. The first is all
the class description information included in the stream. To send a single
instance of
Money, we need to send all of the
following:
- The description of the
ValueObjectclass
- The description of the
Moneyclass
- The instance data associated with the specific instance
of
Money.
This isn't a lot of information, but it's information that RMI
computes and sends with every method invocation. (Recall that RMI resets the
serialization mechanism with every method call.)
Even if the first two bullets comprise only 100 extra bytes of information,
the cumulative impact is probably significant.
The second problem is that each serialized instance is treated
as an individual unit. If we are sending large numbers of instances within a
single method invocation, then there is a fairly good chance that we could
compress the data by noticing commonalities across the instances being
sent.
It Is Easy to Send More Data Than Is Required
Serialization is a recursive algorithm. You pass in a single
object, and all the objects that can be reached from that object by following
instance variables, are also serialized. To see why this can cause problems,
suppose we have a simple application that uses the
Employeeclass:
public class Employee implements Serializable {
public String firstName;
public String lastName;
Public String socialSecurityNumber;
}
In a later version of the application, someone adds a new piece
of functionality. As part of doing so, they add a single additional field to
Employee:
public class Employee implements Serializable {
public String firstName;
public String lastName;
Public String socialSecurityNumber;
Public Employee manager;
}
What happens as a result of this? On the bright side, the
application still works. After everything is recompiled, the entire
application, including the remote method invocations, will still work. That's
the nice aspect of serialization--we added new fields, and the data format
used to send arguments over the wire automatically adapted to handle our
changes. We didn't have to do any work at all.
On the other hand, adding a new field redefined the data format
associated with
Employee. Because
serialVersionUIDwasn't defined in the first version of
the class, none of the old data can be read back in anymore. And there's an
even more serious problem: we've just dramatically increased the bandwidth
required by remote method calls.
Suppose Bob works in the mailroom. And we serialize the object
associated with Bob. In the old version of our application, the data for
serialization consisted of:
- The class information for
Employee
- The instance data for Bob
In the new version, we send:
- The class information for
Employee
- The instance data for Bob
- The instance data for Sally (who runs the mailroom and
is Bob's manager)
- The instance data for Henry (who is in charge of
building facilities)
- The instance data for Alison (Director, Corporate
Infrastructure)
- The instance data for Mary (VP in charge of IT)
And so on...
The new version of the application isn't backwards-compatible
because our old data can't be read by the new version of the application. In
addition, it's slower and is much more likely to cause network congestion.
The Externalizable Interface
To solve the performance problems associated with making a class
Serializable, the serialization mechanism allows
you to declare that a class is
Externalizableinstead. When
ObjectOutputStream's
writeObject( )method is called, it performs the
following sequence of actions:
- It tests to see if the object is an instance of
Externalizable. If so, it uses externalization to
marshall the object.
- If the object isn't an instance of
Externalizable, it tests to see whether the object is
an instance of
Serializable. If so, it uses
serialization to marshall the object.
- If neither of these two cases apply, an exception is
thrown.
Externalizableis an interface that
consists of two methods:
public void readExternal(ObjectInput in);
public void writeExternal(ObjectOutput out);
These have roughly the same role that
readObject( )and
writeObject(
)have for serialization. There are, however, some very important
differences. The first, and most obvious, is that
readExternal( )and
writeExternal(
)are part of the
Externalizableinterface.
An object cannot be declared to be
Externalizablewithout implementing these methods.
However, the major difference lies in how these methods are
used. The serialization mechanism always writes out class descriptions of all
the serializable superclasses. And it always writes out the information
associated with the instance when viewed as an instance of each individual
superclasses.
Externalization gets rid of some of this. It writes out the
identity of the class (which boils down to the name of the class and the
appropriate
serialVersionUID). It also stores the
superclass structure and all the information about the class hierarchy. But
instead of visiting each superclass and using that superclass to store some of
the state information, it simply calls
writeExternal(
)on the local class definition. In a nutshell: it stores all the
metadata, but writes out only the local instance information.
TIP: This is true even if the superclass
implements
Serializable. The metadata about the
class structure will be written to the stream, but the serialization
mechanism will not be invoked. This can be useful if, for some reason, you
want to avoid using serialization with the superclass. For example, some of
the Swing classes,
while they claim to implement
Serializable, do so
incorrectly (and will throw exceptions during the serialization process). (JTextAreais one of the most egregious offenders.) If
you really need to use these classes, and you think serialization would be
useful, you may want to think about creating a subclass and declaring it to
be
Externalizable. Instances of your class will
be written out and read in using externalization. Because the superclass is
never serialized or deserialized, the incorrect code is never invoked, and
the exceptions are never thrown.