Coordinating
Multi-Host Backups Using Perl Sockets
A. Clay Stephenson
Systems administrators are often tasked with cleaning up someone else's mess.
I encountered such a situation when I started working with my current employer.
The company's newly installed Enterprise Resource Planning (ERP) system needed
a backup, and the system integrator had configured an Oracle "hot" backup as
the perfect solution. After looking at the situation, I found a serious problem
with this "perfect" solution -- the backups were essentially useless. The problem
was that this ERP system utilized a split architecture -- meaning that the application
servers hosted metadata that described the actual database data hosted by additional
servers. Unless the metadata exactly matched the described data, the backups,
no matter how reliable, were useless.
A method was needed to halt or freeze the application servers and to then
halt the databases. Then, the backups would be taken and, finally, databases
and applications would be restarted in reverse order. At this point, the system
integrator's answer was to shut down everything and do a traditional "cold"
backup -- not very practical for a production environment.
An alternative to these long downtimes was quite simple -- snapshot mounts
of the filesystems. The backups could then be made at leisure using these snapshots.
The entire system would only need to be down briefly (in practice less than
two minutes per day, which was deemed acceptable). However, I still had to address
the problem of coordinating the shutdown and restart of this group of processes
on multiple hosts in the correct order.
In an ideal world, nothing more complicated than the following would be required
of a simplified system consisting of two ERP application servers connecting
to a database on another dedicated server:
1. Shut down the ERP servers and wait until all background processes finish.
remsh erp_server1 shutdown_erp.sh &
remsh erp_server2 shutdown_erp.sh &
wait
2. Create filesystem snapshots or split mirrors and wait for these tasks to complete.
remsh erp_server1 create_snapshots.sh &
remsh erp_server2 create_snapshots.sh &
wait
3. Shut down the database, snapshot the filesystems, and restart the database.
remsh db_server shutdown_db.sh
remsh db_server create_snapshots.sh
remsh db_server startup_db.sh
4. Restart the ERP applications and wait until all are restarted.
remsh erp_server1 start_erp.sh &
remsh erp_server2 start_erp.sh &
wait
At this point, all applications are available, and we can safely begin the backup
using the snapshots. This simple solution should work quite well, but in practice,
the remsh's often fail to finish because of the way some of the startup
daemons are written. Typically, the remote remsh daemons (remshd)
never terminate, so the "wait" statements are never satisfied.
One answer to synchronizing these tasks would be to have all the backup pre-
and post-exec scripts write data to a network file, but I didn't want to deal
with file locking and cleaning up old files. A second approach would have been
to use a commercial product like BMC Software's "CONTROL-M", but I realized
that this was the perfect time to hone my Perl skills to fashion a sockets-based
client-server pair that would essentially create a set of semaphores accessible
by multiple hosts.
All that really needed to happen was that each backup task would wait until
a semaphore reached a given value before proceeding. For example, the pre-exec
shell script to shut down the database would use the semaphore client to check
on the status of the ERP semaphore before proceeding. When the ERP application
pre-exec script finished shutting down the application, it would set the ERP
semaphore and then loop, waiting for a semaphore indicating that the database
has been restarted. As is typical of most commercial backup packages, HP's OmniBack
II (now called DataProtector) allows one to write pre-exec scripts that are
executed at the start of the entire multi-host backup as well as pre-exec scripts,
which execute on each host. Homegrown solutions can use cron and remsh's
to do much the same thing.
I actually starting writing the pseudo-code scripts to utilize the semaphore
client/server pair before tackling the problem of coding the semaphore applications
themselves. I realized that I needed a few primitive request codes for each
semaphore set:
SET -- Set the value of a semaphore.
GET -- Get the current value of a semaphore.
INC -- Increment the value of a semaphore.
DEC -- Decrement the value of a semaphore.
SET_LIMIT -- Set the limit of a semaphore.
GET_LIMIT -- Get the limit of a semaphore.
In essence, each semaphore would have both a "value" and a "limit", which
could be manipulated by the client. I used case-insensitive string "tags" to
identify each semaphore (e.g., "Resource_1", "Resource_2", etc.). The request_codes
would also be case-insensitive strings so that "Set", "SET", and "set" would
all be recognized as valid arguments. The basic command-line structure of the
client-side began to take shape. Each request would need three arguments (besides
optionally identifying the server host and port) and would take this form:
RSLT=$(client.pl tag request_code value)
STATUS=${?}
Here is a more concrete example assuming that the current value of Resource_1
is 2:
RSLT=$(client.pl -h remotehost -P 7777 -t 5 Resource_1 INC 1)
STATUS=${?}
This would make a request of the semaphore server running on host remotehost,
using port 7777 with a timeout of 5 seconds. It would increment the current value
(2) by 1 and return a new value of 3 in ${RSLT} and set an exit status of 0 indicating
success. The tag values (Resource_1 through Resource_8) could mean anything you
like. For example, Resource_2 might be used to indicate the status of a database
instance.
As part of the design, I decided that all of the values and limits would be
initialized at 0 when the server component was started. I also chose to shut
down the server component when it received a SIGTERM or when a client sent a
special "STOP" request code. It was my intention to hide all the details of
setting up bi-directional sockets, host and port name lookups, dealing with
timeouts, and argument parsing so that all that was needed to manipulate the
Perl black-boxes was a bit of shell scripting ability.
To get a feel for how these routines work, I suggest loading both server.pl
(Listing 1) and client.pl (Listing
2) on one host (named "boss" below) and just the client.pl script on another
host. Perl version 5.005 or later should be installed. If you are running them
on the same host, then:
$ server.pl -P 7777 # or any other available tcp port
Now, let's use the client to increment the initial value, 0:
$ client.pl -P 7777 Resource_1 INC 1
1
Repeat the command and note that the value increases to 2:
$ client.pl -P 7777 Resource_1 INC 1
2
Decrement the value by 1:
$ client.pl -P 7777 Resource_1 DEC 1
1
If you now execute the client on a different host, you will see the real power
of this approach. Note the additional -h argument to designate the server
hostname. On a different host:
$ client.pl -h boss -P 7777 Resource_1 INC 1
2
Finally, to shut down the server process:
$ client.pl -h boss -P 7777 Resource_1 STOP 0
You can invoke both server.pl and client.pl with a -u argument for full
usage.
The sequence to synchronize the shutdown of two ERP applications servers and
a database server and then restart them in reverse order now becomes:
"Resource_1" will be the tag that is associated with the ERP application.
"Resource_2" will be associated with the database.
#1) This is run on a server called 'boss'; it is the first
# activity started. Start the semaphore server; use the tcp port 7777.
server.pl -P 7777
STAT=${?}
exit ${STAT}
#2) Code to be run on both of the ERP application servers.
# Shutdown the ERP application and increment the 'Resource_1' semaphore.
shutdown_erp.sh
STAT=${?}
if [[ ${STAT} -eq 0 ]]
then
RSLT=$(client.pl -h boss -P 7777 Resource_1 INC 1)
STAT=${?}
# create the snapshot mounts of the ERP filesystems
create_snapshots.sh
fi
# Now loop until the database is back up; indicated by Resource_2 set to > 0.
RSLT=$(client.pl -h boss -P 7777 Resource_2 GET 0)
STAT=${?}
while [[ ${STAT} -eq 0 && ${RSLT} -eq 0 ]]
do
sleep 10
RSLT=$(client.pl -h boss -P 7777 Resource_2 GET 0)
STAT=${?}
done
# Now we can restart the ERP applications.
if [[ ${STAT} -eq 0 ]]
then
start_erp.sh
STAT=${?}
fi
# At this point, the ERP application is restarted and snapshots have been made.
exit ${STAT}
#3) Code to be executed on the database server; it is started at
# the same time as the #2) above.
# Loop until ERP applications are down indicated by Resource_1 being set to 2.
RSLT=$(client.pl -h boss -P 7777 Resource_1 GET 0)
STAT=${?}
while [[ ${STAT} -eq 0 && ${RSLT} -lt 2 ]]
do
sleep 10
RSLT=$(client.pl -h boss -P 7777 Resource_1 GET 0)
STAT=${?}
done
if [[ ${STAT} -eq 0 ]]
then #shutdown database and snapshot
shutdown_db.sh
create_db_snapshots.sh
startup_db.sh
STAT=${?}
if [[ ${STAT} -eq 0 ]]
then # set Resource_2 semaphore to 1
RSLT=$(client.pl -h boss -P 7777 Resource_2 SET 1)
STAT=${?}
fi
fi
exit ${STAT}
At this point, all backups can be made safely using the snapshots. When completed,
the snapshots can be removed, and a final call to the semaphore server can be
made to shut it down.
RSLT=$(client.pl -h boss -P 7777 Resource_1 STOP 0)
If you are interested in the details of the Perl semaphore server and client,
examine Listings 1 and 2.
The key feature of the server is that it is intentionally single-threaded and
thus the operations are atomic. By default, a named tcp port "omnisync" is used;
an entry in the services file or map can be made for it. With very small changes
(Windows lacks the alarm() system call), the client can also be executed
on a Windows platform after installing one of the freely available Perl implementations
for Windows. The minor changes necessary are indicated in the client code. It
thus becomes possible to coordinate tasks not only among UNIX hosts but also among
Windows platforms.
The examples shown in this article demonstrate how to coordinate a backup
among multiple hosts, but there are potentially many uses for these tools. I
was pleased with the ease with which Perl handles sockets. It was remarkably
similar to the way that I would have done this in C or C++.
A. Clay Stephenson has been a UNIX developer and systems administrator
for more than 20 years. With a background in Physics, he has been employed in
the aerospace and chemical manufacturing industries for several years. He is
the current leading contributor to HP's ITRC Forums and can be reached at: cstephen@chemfirst.com.