SolarisTM
Failover -- Keeping Connected at All Times
Brian Gollsneider and Arthur Messenger
Success in today's high-tech world demands high-availability systems. Five
or six 9s availability requires high-end hardware, stable operating systems,
and a stable connection to the network. In SolarisTM 8, Sun introduced
IP Network Multipathing. This capability allows administrators to create a hot
standby for a network interface card (NIC) or to configure several active NICs
on a machine in a multipath group to back up each other. The hot standby can
take over for a failed primary card in as little as 100 ms. In this article,
we present how to configure a system for failover, then describe the network
impact of multipath groups, how resilient normal network applications are to
timeouts, and the system logging and notification of the appropriate events.
We assume a working Solaris 8 system that is on a network and a second available
network card. Ethereal was used to monitor and record the network activity.
Background
Network failover is the ability to recover from a network problem on one network
path and switch to another. The failure can be the network card itself dying,
the network cable being cut or disconnected, or some other equivalent event.
Note that we forced network failures by physically disconnecting the network
cable at the appropriate time. Sun's IP Network Multipathing has three main
parts: failure detection, repair detection, and outbound load spreading.
Failure detection is sensing when a link is no longer good. On the other hand,
repair detection is determining when the link is good again. Outbound load spreading
is dividing the network traffic leaving the system between the network interfaces.
Earlier releases of Solaris supported multiple network cards but did not have
failover. The common approach to simulate failover previously was to write a
script that continually pinged a host. If an answer was received, then no action
was taken because the network connection was good. If the ping failed, then
that interface was brought down and a backup interface was configured and activated.
Although this approach worked, it had limitations and was not very elegant.
With Solaris 8, administrators get the ability to configure network failover
in several ways. The two primary ways are standby and active. Standby is where
the primary network card is used until a failure and then the system switches
to the standby card; active is where both cards are active until one fails,
at which time all traffic is sent through the remaining card.
Details of IP Network Multipathing
IP Network Multipathing uses a daemon, /sbin/in.mpathd, to watch over a group
of NICs. A private address used only by in.mpathd is established on each NIC.
The in.mpathd daemon issues echo requests, a ping, to a node on the IP link1.
Note that the node is the default router if there is one. If there is no router,
the node is determined by sending a multicast packet to the "all hosts" multicast
address, 224.0.0.1. The first few hosts to reply become the node. In our small
test network with nine other hosts, in.mpathd starting echo requesting five
of the nine responding hosts in a random fashion. If there are five consecutive
echo request failures on a NIC in the group that in.mpathd is watching, failure
has been detected and the link is declared not to be functioning.
The NIC in the group with the least number of logical interfaces has a logical
interface created on it for the failed NIC's IP address by in.mpathd. IP will
then start using this new NIC. The in.mpathd daemon continues to send echo requests
on the failed NIC while it has been declared non-functioning. When it has 10
consecutive echo request successes on the failed NIC's private address (i.e.,
it has detected the repair of the link), in.mpathd re-establishes the IP address
on the NIC and removes the logical interface on the new NIC.
IP is now using the original NIC. What we have called a private address is
really a deprecated IP address -- an address that IP will not use unless explicitly
told to. These private addresses must be visible to the IP link. This usually
implies that the private IP address has the same network address as the echo
request responding node. At the least, the echo request responder must have
echo response turned on in the IP stack or in.mpathd will have no way of seeing
whether the NIC is down.
If there is no router on the network, then the echo request responder must
at least have address 224.0.0.1, "all hosts" multicast address active. The file
/etc/default/mpathd is created during installation and controls several aspects
of in.mpathd's behavior, the most important of which are detection time for
a failure and whether failback is allowed. Listing 1
shows /etc/default/mpathd with the default comments removed.
FAILURE_DETECTION_TIME is set to 10000 milliseconds or 10 seconds by default.
This can be dropped as low as 100 ms for time-critical applications connectivity.
This is the time to determine a NIC failure, which is defined as five consecutive
echo request failures. The system therefore divides FAILURE_DETECTION_TIME into
five approximately equal time segments to do the pings. Of course, smaller values
for FAILURE_DETECTION_TIME place a higher load on the network. FAILBACK=yes
tells in.mpathd to go back to the original NIC if it determines that it has
been repaired. We did not work with TRACK_INTERFACES_ONLY_WITH_GROUPS.
TRACK_INTERFACES_ONLY_WITH_GROUPS=yes is the default. If this is no, in.mpathd
will report on all failed NICs on the node even if they are not in the multipath
group. The network events discussed above get logged in /var/adm/messages so
the system activity tool of your choice (i.e., swatch) can be set up to notify
you as necessary. See the References for more details than provided in this
article.
Configuring a Hot Standby NIC (Command Line)
These are the steps at the command line to set up a NIC as a hot standby to
take over network if the primary card failed. Listing
2 shows the commands and standard out for the configuration.
Step 1: Check the state of the current network interface. The first ifconfig
-a (command [1]) in Listing 2 shows that interface
iprb0 is up using 10.1.1.1 as part of the 10.1.1.0 network, so we have verified
that the system is in a normal network state.
Step 2: Configure the primary card; this is shown by command [2] in Listing
2. We chose the unique address on the network of 10.1.1.200 to use as the
private address. Since this was our first command with group SERVER1, this created
the IP multipathing group, named it SERVER1 and added the NIC associated with
iprb0 to it. It also started the in.mpathd daemon. The addif 10.1.1.200 netmask
255.255.255.0 broadcast 10.1.1.255 added a logical interface, ipbr0:1, to the
NIC. The -failover marks the 10.1.1.200 as a non-failover address. That is,
in.mpathd will not make a logical interface for it on another NIC if this NIC
should fail. The option deprecated marks 10.1.1.200 as not being available
as a source address for outbound packets unless explicitly asked for (bound).
Finally, up enables the logical interface just created. The second ifconfig
-a (command [3]) shows the successful completion of the command with the
new interface iprb0:1. Notice that the logical interface iprb0:1 is DEPRECATED
and NOFAILOVER.
Step 3: Configure the hot standby card; this is shown by command [4] in Listing
2. We chose 10.1.1.201 for this address. The plumb option sets up the
connections between the device driver and the NIC, 10.1.1.201 netmask 255.255.255.0
broadcast 101.1.255 sets up the IP address for this NIC. The address is to be
deprecated, a member of the group SERVER1, and not failover if the NIC fails.
The standby option marks this NIC as a hot backup for a failed NIC in
the group SERVER1. Finally, the up option at the end enables the interface.
The final ifconfig -a (command [5]), shows iprb1 successfully configured
to be a hot standby. Later, we will describe some testing to determine how failover
performs.
Configuring a Hot Standby NIC (Startup Scripts)
Next, we will quickly show the equivalent syntax to preserve failover across
reboots. The idea is the same but the implementation syntax varies to a good
extent. Initializing a NIC is a relatively complex activity controlled by the
startup script /etc/init.d/network. This script uses the file /etc/hostname.NICdriver
to determine whether a NIC is to be initialized. Normally, this contains only
the hostname associated with the NIC. This file becomes greatly expanded if
you use multipathing failover. Listing 3 shows first
the contents of /etc/hostname.iprb0, the primary NIC card, and then /etc/hostname.iprb1
-- the hot standby. Again, we don't need to manually start in.mpathd. It gets
started by using group in the configuration files. One caution on the importance
of proper syntax -- on the 10.1.1.11 address, we left out the "up" the first
time. This produced an error that DNS would not work, reporting an error about
not being about to find the hostname of the DNS server.
IP Network Multipathing Testing
By following the above steps, we have configured a primary network card and
a hot standby. For initial testing, we changed the FAILURE_DETECTION_TIME to
the minimum 100 ms in /etc/default/mpathd. Listing 4
shows an extract of the system event logging in /var/adm/messages as we put
traffic on the network and forced network failures and repairs by pulling network
cables. Note that as the system becomes loaded, it cannot keep a failure detection
time of 100 ms and reports what it actually can do. Also, note the various messages
about NIC failures, successful failover to iprb1, and repair detection and failback
to iprb0. Our conclusion is that failover works as advertised although very
short failure detection times may not be supportable under load.
The other part of our testing focused on the connection impact of various
FAILURE_DETECTION_TIMEs. Small values of the parameter add greater network traffic
in the form of more heartbeat pings and might not even be supportable because
of system load. We therefore put the parameter back to its default 10000-ms
value and checked the normal UNIX connection utilities (telnet, ftp, ssh) for
the impact of failover. We found that no utility would time out with a value
of 10 seconds. We were able to repeatedly pull cables, going back and forth
between the cards, and never encountered a failure. We transferred a 100-MB
file with ftp through four failovers back and forth without any problems. We
conclude that the default 10-second value for FAILURE_DETECTION_TIME is adequate
for many applications but that each application will need to be tested.
Configuring Multiple NICs (Startup Scripts)
Configuring the system for two active cards is very similar to the standby
setup. Listing 5 shows the /etc/hostname.iprb1 setup.
It is now a clone of /etc/hostname.iprb1 with different IP values. The /etc/hostname.iprb0
file has no changes. With this setup, the system can failover and recover in
either direction, and it has the advantage of greater throughput because of
the second active NIC.
Conclusions
We found that Sun's IP Network Multipathing facility provides a hot standby
for NIC card and network hardware failures. It can be configured as a primary
NIC with a hot standby or as two primary NICs failing over to the other as necessary.
This capability should be useful to many people, especially regarding systems
with very high-availability requirements. It is very easy to configure your
system to failover as required. The default timeout value of 10 seconds is adequate
for many applications, but the required timeout value for each application will
have to be determined. Very small timeout values may not be supportable by a
loaded system. System events are logged in /var/adm/messages.
References
IP Networking Multipathing Administration Guide, Part 806-7931-10, Sun Microsystems,
Inc., Palo Alto, CA 94303-4900, April 2001.
Man page -- man in.mpathd
1. An IP link is a communication facility or medium over which nodes can communicate
at the link layer. This is the subnetwork access layer of the TCP/IP network
model or layer 1 and 2 of the OSI model. Think LAN, or switch, or 10Base5 (thicknet)
cable.
Brian Gollsneider is working on a PhD in Electrical Engineering from the
University of Maryland. When not buried in research, he is a UNIX instructor
for Learning Tree International. He can be reached at: gollsneb@glue.umd.edu.
Arthur M. Messenger is a retired UNIX systems administrator who occasionally
answers questions for friends and works part time for Learning Tree International.
When not teaching, he lives with his wife in Haymarket, Virginia where they
spend time with their grandchildren. He can be reached at: Arthur.Messenger@att.net.