Fast Sockets, An Interprocess Communication Library
By Nagendra Nagarajayya and S.R. Venkataramanan, February 2001
This document is available in PDF.
Tell us what you think.
Introduction
Interprocess communication (IPC) is the exchange of data between
two or more processes on the same machine. There are various forms
of IPC, such as sockets, memory mapped files, shared memory, pipes,
message queues, and Solaris doors. All of these have the objective of
moving data from one address space to another. The ideal way to move
data would be to use bcopy to move it from one address space to another without
trapping into the kernel. While this would be ideal, it is not possible.
An alternate is the Fast Sockets technology, which uses the Speed library.
Fast Sockets is not a new form of IPC, but an implementation that uses
the Solaris interposition technique to dynamically overlay INET-TCP
sockets. The Speed library uses a combination of doors IPC and memory mapping to
emulate TCP sockets. The Speed design is based on the principle that
minimizing system time translates directly to a gain in application
performance. The Speed library mimimizes system time by using a shared memory map
and using bcopy to read and write from this shared space.
The fastest IPC on Solaris is doors [2], a newer API that was
first available in Solaris 2.6. Applications that want to
communicate using doors need to be explicitly programmed to do so.
Even though doors IPC is very fast, socket-based IPC is more popular,
since it is portable, flexible, and can be used to communicate across
a network. Socket IPC on Solaris software is quite fast and makes use of the loopback interface to move data. But socket IPC has protocol overhead,
the connection set up time is high, and it may be a little heavy-weight
for a fast IPC. Sockets are implemented in the Solaris kernel,
and applications using sockets transfer data using the system calls
read and write. These calls make use of
the kernel to move data by transferring it from the user space to
the kernel, and from the kernel to the user space, thus incurring system time.
Though this kernel dependency is necessary for applications communicating
across a network, it impacts system performance when used for communication
on the same machine. The Speed library uses bcopy and memory mapping
to move data, incurring minimal system time and more user time. This
reduced system time translates to a better elapsed time and frees the
system for other tasks.
Back to Top
The Speed library was designed to overcome the issues of doors
and socket-based IPC. The Speed library has no protocol overhead, the connection
set up time is low, connections are pooled automatically, and read
and write calls are converted into bcopy calls.
bcopy is a user-level call, and kernel usage (system time) is
limited to signaling data availability. Test data shows that
elapsed time with the Speed library is in fact the sum of bcopy times plus
a very small amount of system time, thus approaching the ideal
way to move data between two processes.
Three different implementations of the Speed library are discussed in this article:
- Using doors IPC to transfer data and signal the other process of data
availability
- Using memory maps to transfer data while using doors IPC
to signal data availability
- Using doors IPC to set up the initial
connection and memory mapping to transfer data, while using
semaphores to signal the other process of data availability
The performance of the three implementations is compared with each
other and against TCP sockets. Using TNF instrumentation, the
latency of a read and a write call on
the client and the server was measured. These measurements are compared to explore
the relationships between context switching times and performance.
Back to Top
Basic Concept: Interposition of Shared Objects
Solaris dynamic libraries allow a symbol to be interposed so that if
more than one symbol exists, the first symbol takes precedence
over all other symbols. The environment variable LD_PRELOAD can be used
to load shared objects before any other dependencies are loaded. The Speed library
uses this concept to interpose socket, accept, bind, connect, read,
write, close, and thread_create symbols. See bcopy in Code Example 1.
Speed interposition is needed on both the server and the client
applications. This interposition allows existing client-server applications
to transparently use the library. On the server side, LD_PRELOAD is used to
load the shared library libspeedup_server.so, and on the client side LD_PRELOAD
is used to load libspeedup_client.so.
Back to Top
Speed Implementation with Solaris Doors - Implementation I
This implementation of the Speed library is very simple and straightforward.
On the client side, the connect, read, write, close, and thread_create
symbols are interposed on the TCP socket symbols, while on the server side,
the bind, accept, read, write, and close symbols are interposed. Even though
the symbols are interposed, the TCP socket client-server semantics are not
changed. The server establishes a server socket and listens on this socket.
The client connects to this port to establish a connection, and starts reading
and writing information as usual. But instead of flowing through the socket and
the kernel, the data is transferred using the doors IPC.
Because threads in two different processes need to be synchronized to send
and receive data, a producer/consumer paradigm is used to transfer data. In
transferring data from the client to the server, a write operation by the
client is a read operation in the server. In other words, the client becomes
the producer and the server becomes the consumer. The roles are reversed
when transferring data from the server to the client, in which case the server
becomes the producer and the client becomes the consumer.
The functionality is described in detail in the sections that follow.
Back to Top
Server Side Functions
a. bind
Once a server socket has been created, it is named with a call
to the bind function. Since libspeedlib_server.so is
interposed on the server side, the Speed library bind function,
shown in the following code example, is called first. The Speed library establishes the doors service and
then calls the original socket bind.
Code Example 1. Server bind function in Implementation I
int bind(int s, const struct sockaddr *addr,
socklen_t addrlen)
{
int did;
pid_t id;
int dfd,dfd1;
int mask;
static int(*fptr)() = 0;
char buffer[50];
char *bptr=buffer;
if (fptr == 0) {
struct sockaddr_in *ad = (struct
sockaddr_in*)addr;
if ((did = door_create(server,
DOOR_COOKIE, DOOR_UNREF)) < 0) {
perror("door_create");
return -1;
}
1 unlink(NAME_SERVICE_DOOR);
mask = umask(0);
dfd = open(NAME_SERVICE_DOOR,
O_RDONLY|O_CREAT|O_EXCL|O_TRUNC,
0644);
umask(mask);
if (fattach(did, NAME_SERVICE_DOOR)
< 0 ) {
perror("fattach");
return -1;
}
2 fptr = (int (*)())dlsym(RTLD_NEXT,
"bind");
if (fptr == NULL) {
(void) printf("dlopen: %sn",
dlerror());
return (0);
}
}
3 return ((*fptr)(s, addr, addrlen));
}
Back to Top
Step 1The bind function is used to establish a door service
Step 2A dlsym lookup is performed to obtain the actual address of the bind function in libsocket.so. This is stored in the static variable fptr and is used for chaining to the actual bind.
Step 3Libsocket.so bind is called to establish the name.
b. accept
The accept function is used to accept an incoming client connection request.
On a connect request, Speed accept is called since it interposes the libsocket.so's accept.
The Speed library first calls the libsocket.so accept to actually create a TCP connection. Once a successful connection
has been established, the socket descriptor is stored in a Speed data structure, and synchronization variables are initialized.
c. read
The read on the server side is a consumer of the client-write data. When
the server tries to read data on a file descriptor, the Speed library read
function is called because it is interposed. A check is made to see whether the
file descriptor matches the established file descriptor and, if so, the read
function waits on the semaphore occupied_r. For all other file descriptors, the
Speed library function transfers control to the libc.so read.
When the client writes some data, the data is transferred using doors
IPC into the server process. The doors service in the server process
identifies whether the operation is a read or write. A sema_post operation
is executed on the occupied_r semaphore, and a sema_wait is executed on
empty_r. The sema_post wakes up the read thread. The read thread copies the data using bcopy and wakes up the door service thread using sema_post on the empty_r semaphore.
Back to Top
Code Example 2. Server read function in Implementation I
door_service(void *cookie, char *argp, size_t arg_size,
door_desc_t*dp, uint_t n_descriptors)
{
...
} else if (ptr->type == WRITE) {
1 fd = ports[ptr->port];
SEMA_WAIT(&pmap[fd].empty_r);
2 bcopy(ptr->buf, pmap[fd].rbuf, ptr->size);
sema_post(&pmap[fd].occupied_r);
}
...
}
Step 1a. Make sure that the operation is a client write.
b. Copy the port you are communicating on. The client sends it through the door call.
c. Execute a sema_wait to see if there is space in the Speed data buffer (producer).
Step 2a. bcopy from the door buffer to the Speed data buffer.
b.Execute sema_post to signal to the Speed server side read that data is available.
Back to Top
read(int fildes, char* buf, size_t nbyte )
{
...
1 if (fildes > 0 && pmap[fildes].fd == fildes) {
2 SEMA_WAIT(&pmap[fildes].occupied_r);
bcopy(pmap[fildes].rbuf, buf, nbyte);
sema_post(&pmap[fildes].empty_r);
return nbyte;
}
...
}
Step 1Check for a valid file descriptor and see whether it exists in the speed data structure, i.e., a valid socket descriptor.
Step 2a. Execute sema_wait to see if data exists in the Speed data buffer to read (consumer).
b. If successful, use bcopy to copy the data from the speed data structure to the application buffer.
c. Execute sema_post to signal producer (client) that data has been read.
d. write
The write function on the server side is a producer of
the client-read data. When the server tries to write data
on a file descriptor, the Speed function write is called
since it is interposed. A check is made to see whether the
file descriptor matches the established connection file descriptor,
and, if so, the write function waits on the semaphore empty_w.
If successful, the data is copied to the Speed buffer and
sema_post is executed on occupied_w.
Back to Top
When the client tries to read data, a fast context switch is done into the server
process using doors IPC. The doors service in the server process identifies if the
operation is a read or write, and a sema_wait operation is executed
on the occupied_w semaphore. The sema_post on
occupied_w by the write thread wakes up the door_service
thread, and the data in the Speed buffer is transferred to the client-read buffer.
A sema_post is executed on the empty_w semaphore.
Code Example 3. Server write function in Implementation I
ssize_t write(int fildes, const void *buf, size_t nbyte)
{
...
if (fildes > 0 && pmap[fildes].fd == fildes) {
1 SEMA_WAIT(&pmap[fildes].empty_w);
2 bcopy(buf, pmap[fildes].wbuf, nbyte);
sema_post(&pmap[fildes].occupied_w);
return nbyte;
}
...
}
Step 1a. Execute a sema_wait to see whether there is space in the Speed data buffer (producer).
Step 2a. bcopy from the application buffer to the Speed data buffer.
b. Execute sema_post to signal door_service that data is available.
Back to Top
door_service( void *cookie, char *argp, size_t arg_size,
door_desc_t*dp, uint_t n_descriptors)
{
...
if (ptr->type == READ) {
1 fd = ports[ptr->port];
SEMA_WAIT(&pmap[fd].occupied_w);
2 bcopy(pmap[fd].wbuf, ptr->buf, ptr->size);
sema_post(&pmap[fd].empty_w);
door_return((char*)ptr->buf, ptr->size, NULL, 0);
}
...
}
Step 1a. Client should have executed a read call asking for data; door_call executed for a fast context switch to server address space.
b. Execute a sema_wait to see whether there is data in the speed data buffer.
Step 2a. If successful, bcopy the data from the speed data structure to the door buffer.
b. Execute sema_post to signal producer (server) that data has been read.
Back to Top
Client Side Functions
a. connect
The client establishes a connection to the server using the connect
function. Since the connect symbol is interposed, the Speed version of
connect gets control. A connection is established to the server using
the doors IPC. The libsocket.so connect is called to establish a real
connection. If the connection is successful, the necessary Speed data
structures are created.
Code Example 4. Client connect Function in Implementation 1
int connect(int s, const struct sockaddr *addr, socklen_t addrlen)
{
...
1 if (fptr == 0) {
if ((door_fd=open(NAME_SERVICE_DOOR, O_RDONLY)) < 0) {
perror("Open bogus"), exit(1);
}
info.di_target=0;
if (door_info(door_fd, &info) < 0 ){
perror("Door_info");
printf("errno=%dn", errno);
exit(1);
}
fptr = (int (*)())dlsym(rtld_next, "connect");
if (fptr == null) {
(void) printf("dlopen: %sn", dlerror());
return (0);
}
}
2 dinfo[s].fd = s;
ret = ((*fptr)(s, addr, addrlen));
3 if (ret != -1) {
slen = sizeof(client);
getsockname(s, (struct sockaddr *)&client, &slen);
dinfo[s].port = client.sin_port;
}
return ret;
}
Back to Top
b. read
When the client calls read to get data from the server,
the Speed version of read is called. A check is made to
ensure that the file descriptor matches the established connection,
and a fast context switch is made into the server door_service.
On return, the server data is copied into the client buffer using bcopy.
Code Example 5. Client read Function in Implementation I
ssize_t read(int fildes, void *buf, size_t nbyte)
{
...
1 if (dinfo[fildes].fd > 0 && dinfo[fildes].fd == fildes) {
dinfo[fildes].port));
dinfo[fildes].read.fd=dinfo[fildes].fd;
dinfo[fildes].read.port=dinfo[fildes].port;
dinfo[fildes].read.buf[0] = '0';
dinfo[fildes].read.size=nbyte;
dinfo[fildes].read.type=READ;
dinfo[fildes].darg_r.data_ptr = (char*)&dinfo[fildes].read;
dinfo[fildes].darg_r.data_size = PADSIZE + nbyte + 1;
dinfo[fildes].darg_r.desc_ptr = NULL;
dinfo[fildes].darg_r.desc_num = 0;
dinfo[fildes].darg_r.rbuf = (char*)dinfo[fildes].read.buf;
dinfo[fildes].darg_r.rsize = nbyte;
door_call(door_fd, &dinfo[fildes].darg_r);
2 bcopy(dinfo[fildes].read.buf, buf, nbyte);
return nbyte;
}
...
}
Back to Top
c. write
When the client calls write to send data to the server, the Speed
write is called. A check is made to ensure that the file descriptor
matches the established connection. If so, the write data is bcopied
to a Speed buffer and a fast context switch is made into the server
door_service to wake up the waiting server read thread.
Code Example 6. Client write Function in Implementation I
ssize_t write(int fildes, const void *buf, size_t nbyte)
{
...
1 if (dinfo[fildes].fd > 0 && dinfo[fildes].fd == fildes) {
bcopy(buf, dinfo[fildes].write.buf, nbyte);
dinfo[fildes].write.fd=dinfo[fildes].fd;
dinfo[fildes].write.port=dinfo[fildes].port;
dinfo[fildes].write.size=nbyte;
dinfo[fildes].write.type=WRITE;
dinfo[fildes].darg_w.data_ptr = (char *)&dinfo[fildes].write;
dinfo[fildes].darg_w.data_size = PADSIZE + nbyte + 1;
dinfo[fildes].darg_w.desc_ptr = NULL;
dinfo[fildes].darg_w.desc_num = 0;
dinfo[fildes].darg_w.rbuf = (char*)dinfo[fildes].write.buf;
dinfo[fildes].darg_w.rsize = nbyte ;
door_call(door_fd, &dinfo[fildes].darg_w);
2 return nbyte;
}
...
}
Back to Top
Speed Implementation with Solaris Doors and Memory Map - Implementation II
The implementation of the Speed library with doors and memory maps is more
complex, as data is copied into a memory mapped (mmap(2)) buffer
to avoid making multiple copies of the data. For this implementation, a
sliding window type of buffer management has been adopted. For every connection, the
server creates a shared memory mapped segment. This segment is
divided into multiple windows. Each window is further divided
into slots and the number of slots and the slot sizes are configurable.
The libsocket.so accept is no longer called for loopback connections,
but it is simulated. However, libsocket.so accept is called for
connections coming across the network. The connections are automatically
pooled. This was done to re-use the memory map segments instead of creating
them for every connection. The server caches the connection and, if a client
re-connects, a connection is returned from the pool.
Data is now directly bcopied into an available slot in the
memory mapped segment. The doors IPC is used only to make a fast
context switch into the server process. This makes doors extremely
lightweight, resulting in very fast context switch times. The data
consumption is still based on the producer/consumer model. The producer
now has more slots to copy the data, as the memory mapped segment is divided
into windows and slots.
Back to Top
Server Side Functions
a. bind
The bind function operates as in implementation I, creating a new
door service. It initializes buffer management variables and calls the
libsocket.so bind to bind the name.
Code Example 7. Server bind Function in Implementation II
int bind(int s, const struct sockaddr *addr, socklen_t addrlen)
{
...
1 if (fptr == 0) {
cptr = ( struct sockaddr_in*) addr;
if ((did = door_create(server, DOOR_COOKIE, DOOR_UNREF)) < 0) {
perror("door_create");
return -1;
}
sprintf(bptr, "%s%d", name_service_door, cptr->sin_port);
unlink(bptr);
mask = umask(0);
dfd = open(bptr, O_RDONLY|O_CREAT|O_EXCL|O_TRUNC, 0644);
umask(mask);
if (fattach(did, bptr) < 0 ) {
perror("fattach");
return -1;
}
2 accept_block = FALSE;
if (getenv("SPEED_ACCEPT_BLOCK") != 0)
accept_block = TRUE;
mutex_init(&connect_m, USYNC_THREAD, NULL);
mutex_init(&used_doors.access, USYNC_THREAD, NULL);
used_doors.front = MAX_FDS;
used_doors.number = 0;
mutex_init(&open_doors.access, USYNC_THREAD, NULL);
open_doors.index = 0;
open_doors.open = 0;
/* BUFSIZE = 8192, 8192 / 2 for r, and w, /winsz for number
of wins */
bptr = (char*)getenv("SPEED_NOWINS");
if (bptr == NULL)
tparams.nowins = NOWINS;
else
tparams.nowins = atoi(bptr);
if (tparams.nowins <= 0)
tparams.nowins = nowins;
if ((bptr = (char*)getenv("SPEED_WINSIZE")) == (char*)null)
tparams.winsz = bufsize/4;
else {
tparams.winsz = atoi(bptr);
}
if (tparams.winsz <= 0)
tparams.winsz = bufsize/4;
tparams.bufsize = tparams.winsz * tparams.nowins * full_duplex;
tparams.duplex = full_duplex;
pagesize = getpagesize();
if (pagesize < bufsize)
pagesize = bufsize;
tparams.pagesize = pagesize;
if (tparams.pagesize < (window_attr_sz * 3 * tparams.nowins))
tparams.pagesize = (window_attr_sz * 3 * tparams.nowins);
tparams.pagesize += window_mgmt_sz;
tparams.pagesize += (pagesize - (tparams.pagesize % pagesize));
tparams.mmap_sz = (tparams.winsz * tparams.nowins *
(tparams.duplex+1)) + tparams.pagesize;
fptr = (int (*)())dlsym(rtld_next, "bind");
if (fptr == null) {
debug(fprintf(stderr, "dlopen: %sn", dlerror()));
return (0);
}
sema_init(&accept_p_s, 1, usync_thread, 0);
sema_init(&accept_r_s, 0, usync_thread, 0);
closed_door_q.max_elems = max_fds;
closed_door_q.first_elem = 0;
closed_door_q.last_elem = 0;
closed_door_q.no_elems = 0;
}
3 return ((*fptr)(s, addr, addrlen));
}
Back to Top
b. accept
The accept function is now simulated for loopback connections.
A producer/consumer paradigm is again employed. The door_service
function is the producer of the connections, and the accept function
is the consumer of these connections. The door_service function
produces connections on requests from loopback clients.
The accept function now waits on the semaphore accept_r_s for
a client connection. When a client tries to establish a loopback
connection, a fast context switch is made using doors IPC into the
door_service on the server. The memory mapped structures are
created if it is a new connection, and a sema_post is executed on
accept_r_s by the door_service thread. This wakes up the accept
thread, and a successful connection is created. The TCP ephemeral
port is also simulated.
Code Example 8. Server door_service Function in Implementation II
void door_service(void *cookie, char *argp, size_t arg_size,
door_desc_t*dp,uint_t n_descriptors)
{
...
} else if (ptr->type == CONNECT) {
1 client_doorinfo *ptr = (client_doorinfo*)argp;
size = ptr->size;
mutex_lock(&connect_m); /* At the moment connect requests */
/* are serialized, slowing down this segment */
while(sema_wait(&accept_p_s));
2 connect_port = -1;
if (ptr->port > 0)
connect_port = ptr->port;
accept_fd = socket(AF_INET, SOCK_STREAM, 0);
if (connect_port == -1) {
connect_port = port_avail;
port_avail++;
port_avail %= szshort;
}
3 accept_fd = door_accept(accept_fd, &client,
sizeof(client), 1);
if (accept_fd == -1) {
ptr->port = -1;
sema_post(&accept_p_s);
mutex_unlock(&connect_m);
door_return((char*)ptr, size, NULL, 0);
}
4 ptr->port = client.sin_port;
pmap[fd].state = INUSE;
doconnect(accept_fd, (client_doorinfo*)ptr);
accept_count++;
sema_post(&accept_r_s);
mutex_unlock(&connect_m);
door_return((char*)ptr, size, NULL, 0);
Back to Top
Step 1Connections at the moment are serialized. sema_wait to see if accept is free to create a connection.
Step 2connect_port will be -1 if it is a new connection and
will have a value if it is pooled.
Step 3Create memory mapped segment and data structures needed for the connection.
Step 4If the connection is successful, return the connection information to the client.
int accept(int s, struct sockaddr *addr, Psocklen_t addrlen)
{
...
1 for (;;) {
if (sema_wait(&accept_r_s)) {
for (j=0; j<100; j++);
} else
break;
}
2
accept_count--;
client = (struct sockaddr_in *)addr;
client->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
client->sin_family = AF_INET;
client->sin_port = htons(connect_port);
fildes = accept_fd;
sema_post(&accept_p_s);
return fildes;
...
Step 1Connections at the moment are serialized. sema_wait for client connection request; see above door_service.
Step 2Simulate the TCP connection data and execute a sema_post to signal door_service of a successful connection.
Back to Top
c. read
The server read functions as in Code Example 2 and is a
consumer of the client-write data. Since a sliding window
type of protocol is used, some calculation is required to
find the correct window and the correct slot in the window.
The read function waits on the rd_occupied semaphore. When
the client writes data, it is copied into a memory mapped slot,
and a fast context switch is made into the door_service on the
server. The door service does a sema_wait on the rd_empty
semaphore and, if successful, executes a sema_post operation
on the rd_occupied semaphore. The sema_post wakes up the read thread, and the read thread copies the data using bcopy
and executes a sema_post on the rd_empty semaphore.
Code Example 9. Server read Function in Implementation II
ssize_t read(int fd, void *buf, size_t nbyte)
{
...
if (fd > 0 && pmap[fd].fd == fd) {
1 w_mgmt_ptr = pmap[fd].r_w_mgmt_ptr;
if (pmap[fd].partial_read_flag == 0) {
rd_occupied--;
while(sema_wait(&pmap[fd].rd_occupied));
}
2 win = w_mgmt_ptr[SERVER_ACTIVE_WIN];
w_attr_ptr = (int*)(pmap[fd].r_w_attr_ptr_offset +
WINDOW_INDEX(win));
mptr = pmap[fd].r_mptr;
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
w_dptr = w_dptr + w_attr_ptr[START_ADDR];
3 if (nbyte <= w_attr_ptr[csz]) {
bcopy(w_dptr, buf, nbyte);
w_attr_ptr[start_addr] = nbyte ;
w_attr_ptr[csz] = w_attr_ptr[csz] - nbyte ;
} else if (nbyte > w_attr_ptr[CSZ]) {
bcopy(w_dptr, buf, w_attr_ptr[CSZ]);
nbyte = w_attr_ptr[CSZ];
w_attr_ptr[CSZ] = 0;
}
4 if (w_attr_ptr[CSZ] == 0) {
w_attr_ptr[START_ADDR] = 0 ;
w_mgmt_ptr[SERVER_ACTIVE_WIN]++;
w_mgmt_ptr[SERVER_ACTIVE_WIN]
w_mgmt_ptr[SERVER_ACTIVE_WIN]
% tparams.nowins;
rd_empty++;
pmap[fd].partial_read_flag = 0;
sema_post(&pmap[fd].rd_empty);
}else {
pmap[fd].partial_read_flag = 1;
}
...
}
void door_service(void *cookie, char *argp,
size_t arg_size, door_desc_t*dp, uint_t n_descriptors)
{
...
} else if (ptr->type == WRITE ) {
...
1 while(sema_wait(&pmap[fd].rd_empty));
mptr = (int*)pmap[fd].mdoor.mptr;
w_mgmt_ptr = mptr + WINDOW_MGMT_BEGIN;
w_mgmt_ptr[CLIENT_ACTIVE_WIN]++;
w_mgmt_ptr[CLIENT_ACTIVE_WIN] =
w_mgmt_ptr[CLIENT_ACTIVE_WIN] % tparams.nowins;
sema_post(&pmap[fd].rd_occupied);
...
}
Back to Top
d. write
The write also functions as in implementation I and
is a producer of client-read data. The write function
waits on a wr_empty semaphore and, if successful, bcopies
data into a memory mapped slot. It executes a sema_post
on the wr_occupied semaphore to wake up the door_service
thread. When the client tries to read some data, a fast
context switch is made into the door_service on the server,
and a sema_wait is executed on the wr_occupied semaphore.
If successful, a sema_post is executed on the wr_empty
semaphore.
Code Example 10. Server write Function in Implementation II
ssize_t write(int fd, const void *buf, size_t nbyte)
{
...
if (fd > 0 && pmap[fd].fd == fd) {
1 cbuf = (void*)buf;
csz = nbyte;
w_mgmt_ptr = pmap[fd].w_w_mgmt_ptr;
mptr = pmap[fd].w_mptr;
while(csz > 0) {
wr_empty--;
sema_ptr = (sema_t*)&pmap[fd].wr_empty;
while(sema_wait(&pmap[fd].wr_empty));
2 win = w_mgmt_ptr[CLIENT_ACTIVE_WIN];
w_attr_ptr = (int*) (pmap[fd].w_w_attr_ptr_offset +
WINDOW_INDEX(win));
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
if (csz <= w_attr_ptr[sz]) {
bcopy(cbuf, w_dptr, csz);
w_attr_ptr[csz] = csz;
cbuf = ((char*)cbuf) + csz;
csz = 0;
} else if (csz > w_attr_ptr[SZ]) {
bcopy(cbuf, w_dptr, w_attr_ptr[SZ]);
w_attr_ptr[CSZ] = w_attr_ptr[SZ];
csz = csz - w_attr_ptr[SZ];
cbuf = ((char*)cbuf) + w_attr_ptr[SZ];
}
w_mgmt_ptr[CLIENT_ACTIVE_WIN]++;
w_mgmt_ptr[CLIENT_ACTIVE_WIN] =
w_mgmt_ptr[CLIENT_ACTIVE_WIN] %
tparams.nowins;
wr_occupied++;
sema_ptr = (sema_t*)&pmap[fd].wr_occupied;
sema_post(&pmap[fd].wr_occupied);
}
...
}
void door_service(void *cookie, char *argp, size_t arg_size,
door_desc_t*dp, uint_t n_descriptors)
{
if (ptr->type == READ) {
while(sema_wait(&pmap[fd].wr_occupied));
...
sema_post(&pmap[fd].wr_empty);
door_return((char*)&ptr->ret, sizeof(int), NULL, 0);
}
Back to Top
Client Side Functions
a. connect
The connect function does a fast context switch to
the door_service to set up a connection with the server.
See the example of the server side accept function in
Code Example 8. On the return from the door service, the
shared memory mapped segment is mapped into the client address
space. The connect function caches client connections, and if
the client reconnects, it sends the cached descriptor to
the server to reestablish the connection.
b. read
The read function is similar to the server read function,
except it is on the client-side. The read function does a fast
context switch to the door_service on the server and waits for
server-write data. On return from the door call, the data is
bcopied to the client buffer from the memory mapped slot.
Back to Top
Code Example 11. Client read Function in Implementation II
ssize_t read(int fildes, void *buf, size_t nbyte)
{
...
if (fildes > 0 && dinfo[fildes].fd == fildes) {
if (dinfo[fildes].partial_read_flag == 0) {
...
1 dinfo[fildes].rinfo.size=nbyte;
dinfo[fildes].rinfo.type=READ;
dinfo[fildes].rinfo.port = dinfo[fildes].port;
darg.data_ptr = (char *)&dinfo[fildes].rinfo;
darg.data_size = sizeof(readinfo);
darg.desc_ptr = NULL;
darg.desc_num = 0;
darg.rbuf = (char*)&dinfo[fildes].rinfo.ret;
darg.rsize = sizeof(int);
/* semapore block on occupied will happen
in the door server */
door_call(door_fd, &darg);
if (dinfo[fildes].rinfo.ret == -1) {
dinfo[fildes].state = CLOSE;
return 0;
}
if (dinfo[fildes].rinfo.ret > 0) {
dinfo[fildes].rinfo.nowins =
dinfo[fildes].rinfo.ret;
dinfo[fildes].rinfo.nowins--;
dinfo[fildes].state = IN_CLOSE;
}
}
}
2 mptr = dinfo[fildes].r_mptr;
win = w_mgmt_ptr[SERVER_ACTIVE_WIN];
w_attr_ptr = (int*)(dinfo[fildes].r_w_attr_ptr_offset +
WINDOW_INDEX(win));
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
w_dptr = w_dptr + w_attr_ptr[START_ADDR];
if (nbyte <= w_attr_ptr[csz]) {
bcopy(w_dptr, buf, nbyte);
w_attr_ptr[start_addr] = nbyte ;
w_attr_ptr[csz] = w_attr_ptr[csz] - nbyte ;
} else if (nbyte > w_attr_ptr[CSZ]) {
bcopy(w_dptr, buf, w_attr_ptr[CSZ]);
nbyte = w_attr_ptr[CSZ];
w_attr_ptr[CSZ] = 0;
}
if (w_attr_ptr[CSZ] == 0) {
w_attr_ptr[START_ADDR] = 0 ;
w_mgmt_ptr[SERVER_ACTIVE_WIN]++;
w_mgmt_ptr[SERVER_ACTIVE_WIN] =
w_mgmt_ptr[SERVER_ACTIVE_WIN]
% tparams.nowins;
dinfo[fildes].partial_read_flag = 0;
}else {
dinfo[fildes].partial_read_flag = 1;
}
return nbyte;
}
}
Back to Top
c. write
The write is similar to the server-side write function. Client-data is bcopied to a memory mapped slot, a fast context switch is executed to enter the door_service function on the server, and the waiting server read thread is woken up.
Code Example 12. Client write Function in Implementation II
ssize_t write(int fildes, const void *buf, size_t nbyte)
{
if (fildes > 0 && dinfo[fildes].fd == fildes) {
cbuf = (void*)buf;
csz = nbyte;
w_mgmt_ptr = dinfo[fildes].w_w_mgmt_ptr;
mptr = dinfo[fildes].w_mptr;
while(csz > 0) {
win = w_mgmt_ptr[CLIENT_ACTIVE_WIN];
w_attr_ptr = (int*) (dinfo[fildes].w_w_attr_ptr_offset +
WINDOW_INDEX(win));
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
if (csz <= w_attr_ptr[sz]) {
bcopy(cbuf, w_dptr, csz);
w_attr_ptr[csz] = csz;
cbuf = ((char*)cbuf) + csz;
csz = 0;
} else if (csz > w_attr_ptr[SZ]) {
bcopy(cbuf, w_dptr, w_attr_ptr[SZ]);
w_attr_ptr[CSZ] = w_attr_ptr[SZ];
csz = csz - w_attr_ptr[SZ];
cbuf = ((char*)cbuf) + w_attr_ptr[SZ];
}
dinfo[fildes].winfo.size=nbyte;
dinfo[fildes].winfo.type=WRITE;
dinfo[fildes].winfo.port= dinfo[fildes].port;
darg.data_ptr = (char *)&dinfo[fildes].winfo;
darg.data_size = sizeof(writeinfo);
darg.desc_ptr = NULL;
darg.desc_num = 0;
darg.rbuf = NULL;
darg.rsize = 0;
door_call(door_fd, &darg);
}
return nbyte;
}
...
}
Back to Top
Configuration Environment Variables
As mentioned earlier the memory mapped segment is
divided into windows, and each window is divided into
slots. The number of windows is not configurable at this
time, but the number and size of slots
are configurable through environment variables. Currently,
the number of slots is limited to 152. This could be
increased for better performance.
Environment Variable
Value Range
SPEED_NOWINS2-152
SPEED_WINSIZE256-8192
Back to Top
Speed Implementation with Memory Map Only - Implementation III
This implementation is similar to implementation II as discussed
above, but doors IPC is not used for context switching. Instead,
system-scope semaphores are created in the shared mmapped space,
and a sema_post is executed on these semaphores to signal data
availability. The bind, accept and connect functions have similar functionality.
Server Side Functions
a. read
The read function is similar to the read function
in Code Example 9. The server read thread
waits on a system-scope w_mgmt_ptr[SEMA_R_O] semaphore.
The client writes data directly into the mmapped slot,
and executes a sema_post on w_mgmt_ptr[SEMA_R_O] semaphore
to wake up the server read thread. The read thread executes
a bcopy to transfer the data from the memory mapped slot
into the server read buffer.
Back to Top
Code Example 13. Server read Function in Implementation II
ssize_t read(int fd, void *buf, size_t nbyte)
{
...
if (fd > 0 && pmap[fd].fd == fd) {
1 w_mgmt_ptr = pmap[fd].r_w_mgmt_ptr;
if (pmap[fd].partial_read_flag == 0) {
sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_R_O];
while(sema_wait(sema_ptr));
}
2 win = w_mgmt_ptr[SERVER_ACTIVE_WIN];
w_attr_ptr = (int*)(pmap[fd].r_w_attr_ptr_offset +
WINDOW_INDEX(win));
mptr = pmap[fd].r_mptr;
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
w_dptr = w_dptr + w_attr_ptr[START_ADDR];
if (nbyte <= w_attr_ptr[csz]) {
bcopy(w_dptr, buf, nbyte);
w_attr_ptr[start_addr] = nbyte ;
w_attr_ptr[csz] = w_attr_ptr[csz] - nbyte ;
} else if (nbyte > w_attr_ptr[CSZ]) {
bcopy(w_dptr, buf, w_attr_ptr[CSZ]);
nbyte = w_attr_ptr[CSZ];
w_attr_ptr[CSZ] = 0;
}
3 if (w_attr_ptr[CSZ] == 0) {
w_attr_ptr[START_ADDR] = 0 ;
w_mgmt_ptr[SERVER_ACTIVE_WIN]++;
w_mgmt_ptr[SERVER_ACTIVE_WIN] =
w_mgmt_ptr[SERVER_ACTIVE_WIN]
% tparams.nowins;
rd_empty++;
pmap[fd].partial_read_flag = 0;
sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_R_E];
sema_post(sema_ptr);
}else {
pmap[fd].partial_read_flag = 1;
}
return nbyte;
}
return ((*fptr)(fd, buf, nbyte));
}
Back to Top
b. write
The write function is similar to the write function
described in implementation II. The write thread waits
on a system-scope w_mgmt_ptr[SEMA_W_E] semaphore. If
successful, data is bcopied into the memory mapped slot,
and a sema_post is executed on the w_mgmt_ptr[SEMA_W_O]
semaphore to wake up the client read thread.
Code Example 14. Server write Function in Implementation III
ssize_t write(int fd, const void *buf, size_t nbyte)
{
...
if (fd > 0 && pmap[fd].fd == fd) {
cbuf = (void*)buf;
csz = nbyte;
w_mgmt_ptr = pmap[fd].w_w_mgmt_ptr;
1 mptr = pmap[fd].w_mptr;
while(csz > 0) {
sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_W_E];
while(sema_wait(sema_ptr));
2 win = w_mgmt_ptr[CLIENT_ACTIVE_WIN];
w_attr_ptr = (int*) (pmap[fd].w_w_attr_ptr_offset +
WINDOW_INDEX(win));
w_dptr = mptr + w_attr_ptr[DBUF_OFFSET];
if (csz <= w_attr_ptr[sz]) {
bcopy(cbuf, w_dptr, csz);
w_attr_ptr[csz] = csz;
cbuf = ((char*)cbuf) + csz;
csz = 0;
} else if (csz > w_attr_ptr[SZ]) {
bcopy(cbuf, w_dptr, w_attr_ptr[SZ]);
w_attr_ptr[CSZ] = w_attr_ptr[SZ];
csz = csz - w_attr_ptr[SZ];
cbuf = ((char*)cbuf) + w_attr_ptr[SZ];
}
w_mgmt_ptr[CLIENT_ACTIVE_WIN]++;
w_mgmt_ptr[CLIENT_ACTIVE_WIN] =
w_mgmt_ptr[CLIENT_ACTIVE_WIN]
% tparams.nowins;
3 sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_W_O];
sema_post(sema_ptr);
}
...
}
Back to Top
Client Side
The read and write functions on the client
side are similar to the read and write functions
on the server side, as shown above.
Configuration Environment Variables
Just as in implementation II, the number and size of
slots are configurable through
environment variables. Currently, the number of slots
is limited to 152. This could be increased for better
performance.
Environment Variable
Value Range
SPEED_NOWINS2-152
SPEED_WINSIZE256-8192
Back to Top
Performance Comparisons
To measure and compare the performance of the various
Speed implementations, several tests were carried out
using a publicly available multithreaded client-server
program. The programs are from "Multithreaded Programming
with PThreads"[6]. Some simple modifications were made to the
client.c and server_ms.c code. The socket file descriptor
was set to fcntl(NDELAY), fcntl(NOBLOCK), TCP_NODELAY to
ensure fast response times. TNF instrumentation was added
to get more accurate read and write latencies. The client-server
application was first run without the Speed library being
interposed, in other words, with libsocket.so only, and the
measurements were recorded. The application was then run
with the different implementations of the Speed library
interposed, and the measurements were again recorded. The
results are discussed in detail in the following sections.
Back to Top
Latency Measurements Using TNF Instrumentation
To obtain the latency of the read and write functions,
the client and server were executed separately with ktrace
and prex. The measurements were first recorded without the
Speed library and with libsocket.so only. The Speed library
implementations were interposed and the latencies were again
measured. The hardware used was an E450 with 400 Mhz, four
CPUs, two gigabytes of memory, running Solaris 8 (with no
updates) in 32-bit mode.
The client and server were run with all CPUs enabled and
with no processor set, and the times were captured for
different message sizes. Then a processor set was created
with two CPUs, with interrupts disabled in the set. The
server was run in this processor set; the server was started,
and all the LWPs were bound to this processor set by using
psrset -b [set] serverpid. The TNF latency data was tabulated
and is shown in the tables below.
Average Latency to Send and Receive a Message of 70 Bytes
Latency measurements on the server side
Without Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.1256774562899980.0297473708562897100001
With Speed Implementation II interposed0.01108095234047730.0203847802021977100001
With Speed Implementation III interposed0.02054407587924170.0206150824791755100001
With Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.124444387940001.0280241496185034100001
With Speed Implementation II interposed0.05266727330726470.0147964380856189100001
With Speed Implementation III interposed0.02086329081709190.02032837898621100001
Back to Top
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
3. The server was started with the following window configuration for measurements with Speed library interposed:
SPEED_WINSIZE=1024
SPEED_NOWINS=152
Back to Top
Latency measurements on the Client Side
Without Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.1256774562899980.157995061719998100001
With Speed Implementation II interposed0.06218578303999930.0559983035969648100001
With Speed Implementation III interposed0.05879338122618910.0535122446399991100001
With Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.08934535588000210.0763784739052616100001
With Speed Implementation II interposed0.1120573339399980.106013310426899100001
With Speed Implementation III interposed0.05868365956999970.0521766302836968100001
Back to Top
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
Average Latency to Send and Receive a Message of 512 Bytes
Server Side
Without Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.04021890733000030.0491265955040463100001
With Speed Implementation II interposed0.01948866605333820.0133252572474274100001
With Speed Implementation III interposed0.02119825282747190.0224482145878531100001
With Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.03967906020999980.039741772542272100001
With Speed Implementation II interposed0.06615240602594260.01320928669713100001
With Speed Implementation III interposed0.0225980173998270.0236643345866558100001
Back to Top
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
3. The server was started with the following window configuration for measurements with Speed library interposed:
SPEED_WINSIZE=1024
SPEED_NOWINS=152
Client Side
Without Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.1124123331760.10398922812100001
With Speed Implementation II interposed0.06716004550999870.0595745007849909100001
With Speed Implementation III interposed0.06211866816331930.0562640110500007100001
With Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.09419842488999850.085334515084848100001
With Speed Implementation II interposed0.1128328794400010.106662469745304100001
With Speed Implementation III interposed0.05868365956999970.0521766302836968100001
Back to Top
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
Average Latency to Send and Receive a Message of 1000 Bytes
Server Side
Without Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.03978691558000240.0670270991090072100001
With Speed Implementation II interposed0.01670157952420360.012646576994229100001
With Speed Implementation III interposed0.02409163350366570.024710518264817100001
With Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.04081555633000220.0491813939560609100001
With Speed Implementation II interposed0.04805950418495850.0201019823701757100001
With Speed Implementation III interposed0.02409163350366570.024710518264817100001
Back to Top
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
3. The server was started with the following window configuration for measurements with Speed library interposed:
SPEED_WINSIZE=1024
SPEED_NOWINS=152
Client Side
Without Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.1109009862999990.101823674323258100001
With Speed Implementation II interposed0.06274180309196980.0681796264200012100001
With Speed Implementation III interposed0.06616029528999950.0600366353736469100001
With Processor Set
Read in secsWrite in secsNumber of messages
Client-server only0.1025926588599990.0929193959160412100001
With Speed Implementation II interposed0.06817962642000120.0627418030919698100001
With Speed Implementation III interposed0.06783880860000030.0616162311676886100001
Back to Top
Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
Performance Measurements
The Speed library was recompiled without the TNF
instrumentation, and the performance of the client and
server were again measured without interposing the Speed
Library, in other words, with only libsocket.so. The
measurements were again made with the different Speed
libraries interposed. The timings were recorded with the
iobench routine, which is part of the client program. This
routine uses the proc system to get very accurate
measurements.
The client-server application was run with different CPU
configurations and in a processor set to measure the optimum
performance. The results were then tabulated. Only the best
configurations are shown below.
Table Key:
- Client-server only, no Speed library; server run in a processor set of two CPUs, interrupt disabled; client run on the remaining two CPUs
- Speed Implementation I interposed, no processor set, four CPUs
- Speed implementation II interposed, no processor set, four CPUs
- Speed implementation III interposed, no processor set, two CPUs
Send and Receive 100,000 Messages of 70 Bytes
Run Information
TimeIIIIIIIV
Elapsed16.18796.257775.998894.89628
Total96.8937.545135.992114.6787
CPU4.66995.606256.12624.57105
User1.755252.090862.553582.8791
System2.914653.515393.572611.69194
Trap0.0004895660.0001062949.4778e-050.000822242
Wait0.2253470.001197090.0004980264.08094
Process-1.61141e+06-1.61206e+06-1.61111e+06-1.64029e+06
Stopped1.736e-051.324e-051.67e-052.6115e-05
Voluntary context switches87131
Involuntary context switches0012
CPU usage28.9%89.6%102.2%93.4%
Send and Receive 100,000 Messages of 512 Bytes
Run Information
TimeIIIIIIIV
Elapsed9.771468.044536.83385.13338
Total58.619148.265641.001515.3894
CPU7.498727.468397.202274.83241
User1.842762.664593.666893.3498
System5.655964.80383.535391.48261
Trap0.0008576730.0001590980.0001934560.00167688
Wait0.6209770.001569860.0003362084.5697
Process-1.61144e+06-1.61208e+06-1.61113e+06-1.64049e+06
Stopped1.839e-051.275e-051.2425e-052.2145e-05
Voluntary context switches87101
Involuntary context switches0012
CPU usage76.8%92.9%105.4%94.2%
Send and Receive 100,000 Messages of 1000 Bytes
Run Information
TimeIIIIIIIV
Elapsed10.813110.35046.560885.61761
Total64.78362.100339.36416.8415
CPU10.50518.534397.071745.3118
User2.56222.885873.482463.68684
System7.942895.648533.589281.62496
Trap0.0005998360.0002114430.0001579570.00278208
Wait1.079220.001088080.0005551765.13579
Process-1.61145e+06-1.61209e+06-1.61114e+06-1.64038e+06
Stopped1.6255e-052.166e-051.63e-052.4335e-05
Voluntary context switches77101
Involuntary context switches0012
CPU usage97.2%82.5%107.8%94.6%
bcopy Time
The time to bcopy 1000 bytes 100,000 times from a user
buffer to a memory mapped Speed buffer and vice versa, was
measured using the real-time function gethrtime. The
average time of the read and write operation is shown in
the table below. This was measured to estimate the actual
time spent copying data as opposed to the system-time
component, such as the time to context switch, time for sema
operations, and so forth. From this average time, the time
to bcopy 512 bytes and 70 bytes was deduced.
Time to bcopy
Iterations1000 bytes in ms512 bytes in ms70 bytes in ms
10000049025034.3
Volano 1.0 Performance
Running the Volano Mark 1.0 with the Speed library interposed
boosts performance by 5x times. The Speed library does not
work with the newer version of Volano 2.X as the poll() call is
not yet supported by the library.
Back to Top
Observations
- Speed Implementation III seems to be the fastest. The best
time was obtained when run under a two CPU configuration.
While this configuration turned out to be the best for Speed
Implementation III, the others performed badly with this
configuration, as they are more heavy weight and need more
system resources to perform optimally.
- Speed Implementation I has considerable overhead as it
uses the kernel to dynamically allocate memory to copy the
client and server data and transfer it from one address
space to the other.
- Speed Implementation II uses
mmap to copy the data to
overcome the previous limitation. This seems to perform
well, as the doors IPC is used only to make a fast context
switch and to return status.
- Since Speed Implementation III uses system semaphores to
signal the other process, this might be heavy weight and
might incur scheduling time, which could be alleviated by
running the server under a process set.
- The client-server application without the Speed library interposed
seems to perform very badly when the message size is small,
but performance improves considerably with increased message
size.
- The Speed design uses slots to copy the application data.
This could result in fragmentation if an application tries
to read or write data sizes greater than the slot sizes.
Therefore, with data sizes larger than the slots sizes, the
TCP socket might perform better than the Speed library
implementation. This was not tested or measured.
- The current Speed implementation has a limitation of two
windows for read operations and three windows for write
operations. A better performance may result if this is
configurable and is increased.
- The number of slots is limited to 152. This limitation
arises out of using a limited memory, a page for buffer
management. Increasing the memory size removes the
limitation and may yield better numbers.
- The
bcopy time increases with an increase in data size
and approaches user time. The time required to bcopy 1000
byte messages 100,000 times is almost 1.9 seconds because
there are four bcopy operations in a full duplex
operation, which is a client-write to the server and a
client-read from the server. This is about 35% in Speed
Implementation III. Most of the remaining 65% is a constant
overhead of setting up the memory mapped space, connections,
and so forth.
- The latency data on the server side for Speed
Implementation II seems to be faster. However, this does not
include the time spent in the door_service routine.
Including this time should bring the time closer to the time
seen for Speed Implementation III.
- Since the design is an interpose of the TCP socket
library, client-server applications written in a variety of
languages such as C, C++, or Java, should be able to use it
seamlessly.
Back to Top
For Further Research
- With the Speed library interposed, the
bcopy time
approaches user-time as message sizes increase. The
threshold when this happens needs to be measured.
- TCP sockets perform better with an increase in message
sizes. With multithreaded client configurations, five
threads, the performance with large messages sizes, 7000
bytes per
read and write, approaches the performance
with Speed library interposed. This is due to the system
being able to schedule the requests concurrently and squeeze
the wait-time. But at some threshold, the system will become
a bottleneck, as it may not be able to squeeze more
wait-time. This threshold needs to be studied.
- With the Speed library interposed, the application becomes
compute bound instead of being I/O bound. For sustained
sessions, the time-quanta available gets reduced with the
Solaris time-share class. This can be alleviated by
running with a modified priority or by modifying the
dispatch table. This needs to be studied further.
- Currently, the Speed library interposition works as an IPC
mechanism. For connections across the network, the regular
TCP/IP sockets are used. The same concept can be extended
and made to work across a network by interposing the TCP/IP
kernel module. With this extension, incoming and outgoing
messages can be transferred directly between the driver and
the application.
- Providing an API to copy data to and from the mmapped
space should cut
bcopy times by half. This needs to be
explored.
Back to Top
Conclusion
Interposing the TCP socket library with the Speed library
boosts client-server performance by more than 100%. Speed
Implementation II and III offer significant benefits over
TCP/IP for interprocess communication. TCP/IP does not
perform well with small message sizes, but performs well
with an increase in message sizes. Speed Implementation III
outperforms the other implementations, including TCP/IP on a
per CPU basis. In fact, the user time approaches bcopy
time with an increase in message size. At the moment, four
bcopies are needed for a successful read and write
operation, as data needs to be copied from the application
buffer to the Speed memory mapped buffer and back. This
can be reduced by half if an API is exposed, allowing the
client and server applications to write directly to the
memory mapped space instead of using read and write
calls. This could offer a further boost to performance for
bigger message sizes.
Back to Top
Download
You can download the source and test data.
Back to Top
Acknowledgements
We would like to thank Bob Palowoda for his expert advice
and Ezhilan Narasimhan for his work on the DoorLet, which
initiated the idea for this project. We would also like to
thank Rupa Nagendra for helping with the tables and
formatting of this document.
Back to Top
References
- Spring Nucleus, A Microkernel for Objects, Graham Hamilton, Panos Kougiouris, 1993
- SpringOS Doors in Solaris, Jim Voll
- TCP Network Programming (Volumes 1 and 2), Richard Stevens
- Solaris Internals, Richard McDougall, Jim Mauro
- Inside Solaris Columns, www.sunworld.com, Jim Mauro
- Multithreaded Programming with Pthreads, Lewis and Berg. Programs used with permission.
- http://docs.sun.com
Back to Top