Fast Sockets, An Interprocess Communication Library
Fast Sockets, An Interprocess Communication Library   By Nagendra Nagarajayya and S.R. Venkataramanan, February 2001  

This document is available in PDF.  Download Now!  Tell us what you think.

Introduction

Interprocess communication (IPC) is the exchange of data between two or more processes on the same machine. There are various forms of IPC, such as sockets, memory mapped files, shared memory, pipes, message queues, and Solaris doors. All of these have the objective of moving data from one address space to another. The ideal way to move data would be to use bcopy to move it from one address space to another without trapping into the kernel. While this would be ideal, it is not possible.

An alternate is the Fast Sockets technology, which uses the Speed library. Fast Sockets is not a new form of IPC, but an implementation that uses the Solaris interposition technique to dynamically overlay INET-TCP sockets. The Speed library uses a combination of doors IPC and memory mapping to emulate TCP sockets. The Speed design is based on the principle that minimizing system time translates directly to a gain in application performance. The Speed library mimimizes system time by using a shared memory map and using bcopy to read and write from this shared space.

The fastest IPC on Solaris is doors [2], a newer API that was first available in Solaris 2.6. Applications that want to communicate using doors need to be explicitly programmed to do so. Even though doors IPC is very fast, socket-based IPC is more popular, since it is portable, flexible, and can be used to communicate across a network. Socket IPC on Solaris software is quite fast and makes use of the loopback interface to move data. But socket IPC has protocol overhead, the connection set up time is high, and it may be a little heavy-weight for a fast IPC. Sockets are implemented in the Solaris kernel, and applications using sockets transfer data using the system calls read and write. These calls make use of the kernel to move data by transferring it from the user space to the kernel, and from the kernel to the user space, thus incurring system time. Though this kernel dependency is necessary for applications communicating across a network, it impacts system performance when used for communication on the same machine. The Speed library uses bcopy and memory mapping to move data, incurring minimal system time and more user time. This reduced system time translates to a better elapsed time and frees the system for other tasks.

Back to Top

The Speed library was designed to overcome the issues of doors and socket-based IPC. The Speed library has no protocol overhead, the connection set up time is low, connections are pooled automatically, and read and write calls are converted into bcopy calls. bcopy is a user-level call, and kernel usage (system time) is limited to signaling data availability. Test data shows that elapsed time with the Speed library is in fact the sum of bcopy times plus a very small amount of system time, thus approaching the ideal way to move data between two processes.

Three different implementations of the Speed library are discussed in this article:

  • Using doors IPC to transfer data and signal the other process of data availability
  • Using memory maps to transfer data while using doors IPC to signal data availability
  • Using doors IPC to set up the initial connection and memory mapping to transfer data, while using semaphores to signal the other process of data availability

The performance of the three implementations is compared with each other and against TCP sockets. Using TNF instrumentation, the latency of a read and a write call on the client and the server was measured. These measurements are compared to explore the relationships between context switching times and performance.

Back to Top


Basic Concept: Interposition of Shared Objects

Solaris dynamic libraries allow a symbol to be interposed so that if more than one symbol exists, the first symbol takes precedence over all other symbols. The environment variable LD_PRELOAD can be used to load shared objects before any other dependencies are loaded. The Speed library uses this concept to interpose socket, accept, bind, connect, read, write, close, and thread_create symbols. See bcopy in Code Example 1.

Speed interposition is needed on both the server and the client applications. This interposition allows existing client-server applications to transparently use the library. On the server side, LD_PRELOAD is used to load the shared library libspeedup_server.so, and on the client side LD_PRELOAD is used to load libspeedup_client.so.

Back to Top


Speed Implementation with Solaris Doors - Implementation I

This implementation of the Speed library is very simple and straightforward. On the client side, the connect, read, write, close, and thread_create symbols are interposed on the TCP socket symbols, while on the server side, the bind, accept, read, write, and close symbols are interposed. Even though the symbols are interposed, the TCP socket client-server semantics are not changed. The server establishes a server socket and listens on this socket. The client connects to this port to establish a connection, and starts reading and writing information as usual. But instead of flowing through the socket and the kernel, the data is transferred using the doors IPC.

Because threads in two different processes need to be synchronized to send and receive data, a producer/consumer paradigm is used to transfer data. In transferring data from the client to the server, a write operation by the client is a read operation in the server. In other words, the client becomes the producer and the server becomes the consumer. The roles are reversed when transferring data from the server to the client, in which case the server becomes the producer and the client becomes the consumer.

The functionality is described in detail in the sections that follow.

Back to Top

Server Side Functions

a. bind

Once a server socket has been created, it is named with a call to the bind function. Since libspeedlib_server.so is interposed on the server side, the Speed library bind function, shown in the following code example, is called first. The Speed library establishes the doors service and then calls the original socket bind.

Code Example 1. Server bind function in Implementation I

int bind(int s, const struct sockaddr *addr, 
    socklen_t addrlen)				   
{

    int did;
    pid_t id;
    int dfd,dfd1;	
    int mask;

    static int(*fptr)() = 0;
    char             buffer[50];
    char             *bptr=buffer;
				   
    if (fptr == 0) {
      struct sockaddr_in *ad = (struct 
        sockaddr_in*)addr;
      if ((did = door_create(server, 
        DOOR_COOKIE, DOOR_UNREF)) < 0) {
            perror("door_create");
	    return -1;
      }

1         unlink(NAME_SERVICE_DOOR);
          mask = umask(0);
          dfd =  open(NAME_SERVICE_DOOR, 
            O_RDONLY|O_CREAT|O_EXCL|O_TRUNC, 
            0644);
          umask(mask);
          if (fattach(did, NAME_SERVICE_DOOR) 
            < 0 )  {
	    perror("fattach");
	    return -1;
      }

2         fptr = (int (*)())dlsym(RTLD_NEXT, 
            "bind");
          if (fptr == NULL) {
	    (void) printf("dlopen: %sn", 
              dlerror());
	    return (0);
          }
    }

3   return ((*fptr)(s, addr, addrlen));
}

Back to Top

Step 1The bind function is used to establish a door service Step 2A dlsym lookup is performed to obtain the actual address of the bind function in libsocket.so. This is stored in the static variable fptr and is used for chaining to the actual bind. Step 3Libsocket.so bind is called to establish the name.

b. accept

The accept function is used to accept an incoming client connection request. On a connect request, Speed accept is called since it interposes the libsocket.so's accept. The Speed library first calls the libsocket.so accept to actually create a TCP connection. Once a successful connection has been established, the socket descriptor is stored in a Speed data structure, and synchronization variables are initialized.

c. read

The read on the server side is a consumer of the client-write data. When the server tries to read data on a file descriptor, the Speed library read function is called because it is interposed. A check is made to see whether the file descriptor matches the established file descriptor and, if so, the read function waits on the semaphore occupied_r. For all other file descriptors, the Speed library function transfers control to the libc.so read.

When the client writes some data, the data is transferred using doors IPC into the server process. The doors service in the server process identifies whether the operation is a read or write. A sema_post operation is executed on the occupied_r semaphore, and a sema_wait is executed on empty_r. The sema_post wakes up the read thread. The read thread copies the data using bcopy and wakes up the door service thread using sema_post on the empty_r semaphore.

Back to Top

Code Example 2. Server read function in Implementation I
door_service(void *cookie, char *argp, size_t arg_size, 
    door_desc_t*dp, uint_t n_descriptors) 
{
...
    } else if (ptr->type == WRITE) {
1       fd = ports[ptr->port];
	SEMA_WAIT(&pmap[fd].empty_r);

2	bcopy(ptr->buf, pmap[fd].rbuf, ptr->size);
	sema_post(&pmap[fd].occupied_r);
    }
...
}
Step 1a. Make sure that the operation is a client write.
b. Copy the port you are communicating on. The client sends it through the door call.
c. Execute a sema_wait to see if there is space in the Speed data buffer (producer). Step 2a. bcopy from the door buffer to the Speed data buffer.
b.Execute sema_post to signal to the Speed server side read that data is available.

Back to Top

read(int fildes, char* buf, size_t nbyte )
{
...
1  if (fildes  > 0 && pmap[fildes].fd == fildes) {

2    SEMA_WAIT(&pmap[fildes].occupied_r);
     bcopy(pmap[fildes].rbuf, buf, nbyte);  
     sema_post(&pmap[fildes].empty_r);
     return nbyte;
  }
...
}
Step 1Check for a valid file descriptor and see whether it exists in the speed data structure, i.e., a valid socket descriptor. Step 2a. Execute sema_wait to see if data exists in the Speed data buffer to read (consumer).
b. If successful, use bcopy to copy the data from the speed data structure to the application buffer.
c. Execute sema_post to signal producer (client) that data has been read.

d. write

The write function on the server side is a producer of the client-read data. When the server tries to write data on a file descriptor, the Speed function write is called since it is interposed. A check is made to see whether the file descriptor matches the established connection file descriptor, and, if so, the write function waits on the semaphore empty_w. If successful, the data is copied to the Speed buffer and sema_post is executed on occupied_w.

Back to Top

When the client tries to read data, a fast context switch is done into the server process using doors IPC. The doors service in the server process identifies if the operation is a read or write, and a sema_wait operation is executed on the occupied_w semaphore. The sema_post on occupied_w by the write thread wakes up the door_service thread, and the data in the Speed buffer is transferred to the client-read buffer. A sema_post is executed on the empty_w semaphore.

Code Example 3. Server write function in Implementation I
ssize_t write(int fildes, const void *buf, size_t nbyte)
{
...
  if (fildes > 0  && pmap[fildes].fd == fildes) {
1    SEMA_WAIT(&pmap[fildes].empty_w); 
	
2    bcopy(buf, pmap[fildes].wbuf, nbyte); 
     sema_post(&pmap[fildes].occupied_w); 
     return nbyte;
  }
...
} 
Step 1a. Execute a sema_wait to see whether there is space in the Speed data buffer (producer). Step 2a. bcopy from the application buffer to the Speed data buffer.
b. Execute sema_post to signal door_service that data is available.

Back to Top

door_service( void *cookie, char *argp, size_t arg_size, 
    door_desc_t*dp, uint_t n_descriptors)
{
...
if (ptr->type == READ) {
1       fd = ports[ptr->port];
        SEMA_WAIT(&pmap[fd].occupied_w);

2       bcopy(pmap[fd].wbuf, ptr->buf, ptr->size);
        sema_post(&pmap[fd].empty_w);  
        door_return((char*)ptr->buf, ptr->size, NULL, 0);
    }
...
}
Step 1a. Client should have executed a read call asking for data; door_call executed for a fast context switch to server address space.
b. Execute a sema_wait to see whether there is data in the speed data buffer. Step 2a. If successful, bcopy the data from the speed data structure to the door buffer.
b. Execute sema_post to signal producer (server) that data has been read.

Back to Top


Client Side Functions

a. connect

The client establishes a connection to the server using the connect function. Since the connect symbol is interposed, the Speed version of connect gets control. A connection is established to the server using the doors IPC. The libsocket.so connect is called to establish a real connection. If the connection is successful, the necessary Speed data structures are created.

Code Example 4. Client connect Function in Implementation 1
int connect(int s, const struct sockaddr *addr, socklen_t addrlen)
{
...
1	if (fptr == 0) {
		if ((door_fd=open(NAME_SERVICE_DOOR, O_RDONLY)) < 0) {
		    perror("Open bogus"), exit(1);
		}
		info.di_target=0;
		if (door_info(door_fd, &info) < 0 ){
		    	  perror("Door_info");
    			printf("errno=%dn", errno);
	    	exit(1);
  		}

		fptr = (int (*)())dlsym(rtld_next, "connect");
		if (fptr == null) {
			(void) printf("dlopen: %sn", dlerror());
			return (0);
		}
	}

2	dinfo[s].fd = s;
	ret = ((*fptr)(s, addr, addrlen));

3	if (ret != -1) {
		slen = sizeof(client);
		getsockname(s, (struct sockaddr *)&client,  &slen);
		dinfo[s].port = client.sin_port;
	}
	return ret;
}

Back to Top

b. read

When the client calls read to get data from the server, the Speed version of read is called. A check is made to ensure that the file descriptor matches the established connection, and a fast context switch is made into the server door_service. On return, the server data is copied into the client buffer using bcopy.

Code Example 5. Client read Function in Implementation I
ssize_t read(int fildes, void *buf, size_t nbyte)
{
...
1	if (dinfo[fildes].fd > 0 && dinfo[fildes].fd == fildes) {
			dinfo[fildes].port));
		dinfo[fildes].read.fd=dinfo[fildes].fd;
		dinfo[fildes].read.port=dinfo[fildes].port;
		dinfo[fildes].read.buf[0] = '0';
		dinfo[fildes].read.size=nbyte;
		dinfo[fildes].read.type=READ;
		dinfo[fildes].darg_r.data_ptr = (char*)&dinfo[fildes].read;
		dinfo[fildes].darg_r.data_size = PADSIZE + nbyte + 1;
		dinfo[fildes].darg_r.desc_ptr = NULL;
		dinfo[fildes].darg_r.desc_num = 0;
		dinfo[fildes].darg_r.rbuf = (char*)dinfo[fildes].read.buf;
		dinfo[fildes].darg_r.rsize = nbyte;
		door_call(door_fd, &dinfo[fildes].darg_r);
	
2		bcopy(dinfo[fildes].read.buf, buf, nbyte);
		
		return nbyte;
	}
...
}

Back to Top

c. write

When the client calls write to send data to the server, the Speed write is called. A check is made to ensure that the file descriptor matches the established connection. If so, the write data is bcopied to a Speed buffer and a fast context switch is made into the server door_service to wake up the waiting server read thread.

Code Example 6. Client write Function in Implementation I
ssize_t write(int fildes, const void *buf, size_t nbyte)
{
...
1	if (dinfo[fildes].fd > 0 && dinfo[fildes].fd == fildes) {
		bcopy(buf, dinfo[fildes].write.buf, nbyte);  
		dinfo[fildes].write.fd=dinfo[fildes].fd;
		dinfo[fildes].write.port=dinfo[fildes].port;
		dinfo[fildes].write.size=nbyte;
		dinfo[fildes].write.type=WRITE;
		dinfo[fildes].darg_w.data_ptr = (char *)&dinfo[fildes].write;
		dinfo[fildes].darg_w.data_size = PADSIZE + nbyte + 1;
		dinfo[fildes].darg_w.desc_ptr = NULL;
		dinfo[fildes].darg_w.desc_num = 0;
		dinfo[fildes].darg_w.rbuf = (char*)dinfo[fildes].write.buf;
		dinfo[fildes].darg_w.rsize = nbyte ;
		door_call(door_fd, &dinfo[fildes].darg_w);

2		return nbyte;
	}
...
}

Back to Top


Speed Implementation with Solaris Doors and Memory Map - Implementation II

The implementation of the Speed library with doors and memory maps is more complex, as data is copied into a memory mapped (mmap(2)) buffer to avoid making multiple copies of the data. For this implementation, a sliding window type of buffer management has been adopted. For every connection, the server creates a shared memory mapped segment. This segment is divided into multiple windows. Each window is further divided into slots and the number of slots and the slot sizes are configurable.

The libsocket.so accept is no longer called for loopback connections, but it is simulated. However, libsocket.so accept is called for connections coming across the network. The connections are automatically pooled. This was done to re-use the memory map segments instead of creating them for every connection. The server caches the connection and, if a client re-connects, a connection is returned from the pool.

Data is now directly bcopied into an available slot in the memory mapped segment. The doors IPC is used only to make a fast context switch into the server process. This makes doors extremely lightweight, resulting in very fast context switch times. The data consumption is still based on the producer/consumer model. The producer now has more slots to copy the data, as the memory mapped segment is divided into windows and slots.

Back to Top

Server Side Functions

a. bind

The bind function operates as in implementation I, creating a new door service. It initializes buffer management variables and calls the libsocket.so bind to bind the name.

Code Example 7. Server bind Function in Implementation II
int bind(int s, const struct sockaddr *addr, socklen_t addrlen)
{
...
1	if (fptr == 0) {
		cptr = ( struct sockaddr_in*) addr;
		if ((did = door_create(server, DOOR_COOKIE, DOOR_UNREF)) < 0) {
			perror("door_create");
			return -1;
		}
		sprintf(bptr, "%s%d", name_service_door, cptr->sin_port);
		unlink(bptr);
		mask = umask(0);
		dfd =  open(bptr, O_RDONLY|O_CREAT|O_EXCL|O_TRUNC, 0644);
		umask(mask);
		if (fattach(did, bptr) < 0 )  {
			perror("fattach");
			return -1;
		}

2		accept_block = FALSE;
		if (getenv("SPEED_ACCEPT_BLOCK") != 0)
			accept_block = TRUE;
		mutex_init(&connect_m, USYNC_THREAD, NULL);
		mutex_init(&used_doors.access, USYNC_THREAD, NULL);
		used_doors.front = MAX_FDS;
		used_doors.number = 0;
		mutex_init(&open_doors.access, USYNC_THREAD, NULL);
		open_doors.index = 0;
		open_doors.open = 0;
		/* BUFSIZE = 8192, 8192 / 2 for r, and w, /winsz for number
		   of wins */
		bptr = (char*)getenv("SPEED_NOWINS");
		if (bptr == NULL) 
			tparams.nowins = NOWINS;
		else 
			tparams.nowins = atoi(bptr);
		if (tparams.nowins <= 0)
			tparams.nowins = nowins;

		if ((bptr = (char*)getenv("SPEED_WINSIZE")) == (char*)null) 
			tparams.winsz = bufsize/4;
		else  {
			tparams.winsz = atoi(bptr);
		}
		if (tparams.winsz <= 0)
			tparams.winsz = bufsize/4;
		tparams.bufsize = tparams.winsz * tparams.nowins * full_duplex;

		tparams.duplex = full_duplex;
		pagesize = getpagesize();
		if (pagesize < bufsize)
			pagesize = bufsize;
		tparams.pagesize = pagesize;
		if (tparams.pagesize < (window_attr_sz * 3 * tparams.nowins))
			tparams.pagesize =  (window_attr_sz * 3 * tparams.nowins);
		tparams.pagesize += window_mgmt_sz;
		tparams.pagesize += (pagesize - (tparams.pagesize % pagesize));
		tparams.mmap_sz = (tparams.winsz * tparams.nowins * 
			(tparams.duplex+1)) + tparams.pagesize;
		fptr = (int (*)())dlsym(rtld_next, "bind");
		if (fptr == null) {
			debug(fprintf(stderr, "dlopen: %sn", dlerror()));
			return (0);
		}
		sema_init(&accept_p_s, 1, usync_thread, 0);
		sema_init(&accept_r_s, 0, usync_thread, 0);

		closed_door_q.max_elems = max_fds;
		closed_door_q.first_elem = 0;
		closed_door_q.last_elem = 0;
		closed_door_q.no_elems = 0;
	}

3	return ((*fptr)(s, addr, addrlen));

}

Back to Top

b. accept

The accept function is now simulated for loopback connections. A producer/consumer paradigm is again employed. The door_service function is the producer of the connections, and the accept function is the consumer of these connections. The door_service function produces connections on requests from loopback clients.

The accept function now waits on the semaphore accept_r_s for a client connection. When a client tries to establish a loopback connection, a fast context switch is made using doors IPC into the door_service on the server. The memory mapped structures are created if it is a new connection, and a sema_post is executed on accept_r_s by the door_service thread. This wakes up the accept thread, and a successful connection is created. The TCP ephemeral port is also simulated.

Code Example 8. Server door_service Function in Implementation II
void door_service(void *cookie, char *argp, size_t arg_size, 
  door_desc_t*dp,uint_t n_descriptors) 
{

...
	} else if (ptr->type == CONNECT) {
1		client_doorinfo *ptr = (client_doorinfo*)argp; 
		size =  ptr->size;
		mutex_lock(&connect_m); /* At the moment connect requests */ 
                             /* are serialized, slowing down this segment */
		while(sema_wait(&accept_p_s));

2		connect_port = -1;
		if (ptr->port > 0) 
			connect_port = ptr->port;
		accept_fd = socket(AF_INET, SOCK_STREAM, 0);
		if (connect_port == -1) { 
			connect_port = port_avail;
			port_avail++;
			port_avail %= szshort;
		}

3		accept_fd = door_accept(accept_fd, &client, 
                  sizeof(client), 1);

		if (accept_fd == -1) {
			ptr->port = -1;
			sema_post(&accept_p_s);
			mutex_unlock(&connect_m);
			door_return((char*)ptr, size, NULL, 0);
		}

4		ptr->port = client.sin_port;
		pmap[fd].state = INUSE;
		doconnect(accept_fd, (client_doorinfo*)ptr);
		accept_count++;
		sema_post(&accept_r_s);
		mutex_unlock(&connect_m);
		door_return((char*)ptr, size, NULL, 0);

Back to Top

Step 1Connections at the moment are serialized. sema_wait to see if accept is free to create a connection. Step 2connect_port will be -1 if it is a new connection and will have a value if it is pooled. Step 3Create memory mapped segment and data structures needed for the connection. Step 4If the connection is successful, return the connection information to the client.
int accept(int s, struct sockaddr *addr, Psocklen_t addrlen)
{
...
1 for (;;) {
if (sema_wait(&amp;accept_r_s)) {
for (j=0; j<100; j++);
} else
break;
} 2 accept_count--; client = (struct sockaddr_in *)addr;
client->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
client->sin_family = AF_INET;
client->sin_port = htons(connect_port);
fildes = accept_fd;
sema_post(&amp;accept_p_s);
return fildes; ...
Step 1Connections at the moment are serialized. sema_wait for client connection request; see above door_service. Step 2Simulate the TCP connection data and execute a sema_post to signal door_service of a successful connection.

Back to Top

c. read

The server read functions as in Code Example 2 and is a consumer of the client-write data. Since a sliding window type of protocol is used, some calculation is required to find the correct window and the correct slot in the window. The read function waits on the rd_occupied semaphore. When the client writes data, it is copied into a memory mapped slot, and a fast context switch is made into the door_service on the server. The door service does a sema_wait on the rd_empty semaphore and, if successful, executes a sema_post operation on the rd_occupied semaphore. The sema_post wakes up the read thread, and the read thread copies the data using bcopy and executes a sema_post on the rd_empty semaphore.

Code Example 9. Server read Function in Implementation II
ssize_t read(int fd, void *buf, size_t nbyte)
{
...
	if (fd > 0 && pmap[fd].fd == fd) {
1		w_mgmt_ptr = pmap[fd].r_w_mgmt_ptr;
		if (pmap[fd].partial_read_flag == 0) {
			rd_occupied--;
			while(sema_wait(&pmap[fd].rd_occupied)); 
		}

2		win = w_mgmt_ptr[SERVER_ACTIVE_WIN]; 
		w_attr_ptr = (int*)(pmap[fd].r_w_attr_ptr_offset +
			WINDOW_INDEX(win));
		mptr = pmap[fd].r_mptr;
		w_dptr =  mptr + w_attr_ptr[DBUF_OFFSET];
		w_dptr = w_dptr + w_attr_ptr[START_ADDR];

3		if (nbyte <= w_attr_ptr[csz]) {
			bcopy(w_dptr, buf, nbyte);  

			w_attr_ptr[start_addr] = nbyte ;
			w_attr_ptr[csz] = w_attr_ptr[csz] - nbyte ;
		} else if (nbyte > w_attr_ptr[CSZ]) {
			bcopy(w_dptr, buf, w_attr_ptr[CSZ]);   
			nbyte = w_attr_ptr[CSZ]; 
			w_attr_ptr[CSZ] = 0;
		}

4 		if (w_attr_ptr[CSZ] == 0) {
			w_attr_ptr[START_ADDR] = 0 ;
			w_mgmt_ptr[SERVER_ACTIVE_WIN]++;
			w_mgmt_ptr[SERVER_ACTIVE_WIN] 
			w_mgmt_ptr[SERVER_ACTIVE_WIN]
				% tparams.nowins;
			rd_empty++;
			pmap[fd].partial_read_flag = 0;

			sema_post(&pmap[fd].rd_empty); 
		}else {
			pmap[fd].partial_read_flag = 1;
		}
...
}

void door_service(void *cookie, char *argp, 
size_t arg_size, door_desc_t*dp, uint_t n_descriptors) 
{
...
	} else if (ptr->type == WRITE ) {
...
1		while(sema_wait(&pmap[fd].rd_empty));
		mptr = (int*)pmap[fd].mdoor.mptr;
		w_mgmt_ptr =  mptr + WINDOW_MGMT_BEGIN;
		w_mgmt_ptr[CLIENT_ACTIVE_WIN]++;
		w_mgmt_ptr[CLIENT_ACTIVE_WIN] = 
	        w_mgmt_ptr[CLIENT_ACTIVE_WIN] % tparams.nowins;
		sema_post(&pmap[fd].rd_occupied);
...
}

Back to Top

d. write

The write also functions as in implementation I and is a producer of client-read data. The write function waits on a wr_empty semaphore and, if successful, bcopies data into a memory mapped slot. It executes a sema_post on the wr_occupied semaphore to wake up the door_service thread. When the client tries to read some data, a fast context switch is made into the door_service on the server, and a sema_wait is executed on the wr_occupied semaphore. If successful, a sema_post is executed on the wr_empty semaphore.

Code Example 10. Server write Function in Implementation II
ssize_t write(int fd, const void *buf, size_t nbyte)
{
...
	if (fd > 0 && pmap[fd].fd == fd) {
1	        cbuf = (void*)buf;
		csz = nbyte;
		w_mgmt_ptr = pmap[fd].w_w_mgmt_ptr;

		mptr = pmap[fd].w_mptr;
		while(csz > 0) {
		wr_empty--;
		sema_ptr = (sema_t*)&pmap[fd].wr_empty;
		while(sema_wait(&pmap[fd].wr_empty));
	
2       	win = w_mgmt_ptr[CLIENT_ACTIVE_WIN];	
		w_attr_ptr = (int*) (pmap[fd].w_w_attr_ptr_offset + 
		WINDOW_INDEX(win));
		w_dptr =  mptr + w_attr_ptr[DBUF_OFFSET];

		if (csz <= w_attr_ptr[sz]) {
			bcopy(cbuf, w_dptr, csz);  
			w_attr_ptr[csz] = csz;
			cbuf = ((char*)cbuf) + csz;
			csz = 0;
		} else if (csz > w_attr_ptr[SZ]) {
			bcopy(cbuf, w_dptr, w_attr_ptr[SZ]);   
			 w_attr_ptr[CSZ] = w_attr_ptr[SZ];
			csz = csz - w_attr_ptr[SZ];
			cbuf = ((char*)cbuf) + w_attr_ptr[SZ];
		}
		
		w_mgmt_ptr[CLIENT_ACTIVE_WIN]++;
		w_mgmt_ptr[CLIENT_ACTIVE_WIN] =  
		w_mgmt_ptr[CLIENT_ACTIVE_WIN] % 
	tparams.nowins;
		wr_occupied++;
		sema_ptr = (sema_t*)&pmap[fd].wr_occupied;
		sema_post(&pmap[fd].wr_occupied);
	}
...
}

void door_service(void *cookie, char *argp, size_t arg_size, 
door_desc_t*dp, uint_t n_descriptors) 

{
	if (ptr->type == READ) {
		while(sema_wait(&pmap[fd].wr_occupied)); 
...
		sema_post(&pmap[fd].wr_empty);	
		door_return((char*)&ptr->ret, sizeof(int), NULL, 0);
}

Back to Top

Client Side Functions

a. connect

The connect function does a fast context switch to the door_service to set up a connection with the server. See the example of the server side accept function in Code Example 8. On the return from the door service, the shared memory mapped segment is mapped into the client address space. The connect function caches client connections, and if the client reconnects, it sends the cached descriptor to the server to reestablish the connection.

b. read

The read function is similar to the server read function, except it is on the client-side. The read function does a fast context switch to the door_service on the server and waits for server-write data. On return from the door call, the data is bcopied to the client buffer from the memory mapped slot.

Back to Top

Code Example 11. Client read Function in Implementation II
ssize_t read(int fildes, void *buf, size_t nbyte)
{
...
	if (fildes > 0 && dinfo[fildes].fd == fildes) {
		if (dinfo[fildes].partial_read_flag == 0) {
	...	
1               	dinfo[fildes].rinfo.size=nbyte;
			dinfo[fildes].rinfo.type=READ;
			dinfo[fildes].rinfo.port = dinfo[fildes].port;
			darg.data_ptr = (char *)&dinfo[fildes].rinfo;
			darg.data_size = sizeof(readinfo);
			darg.desc_ptr = NULL;
			darg.desc_num = 0;
			darg.rbuf = (char*)&dinfo[fildes].rinfo.ret;
			darg.rsize = sizeof(int);
			/* semapore block on occupied will happen 
			   in the door server */
			door_call(door_fd, &darg);
			if (dinfo[fildes].rinfo.ret == -1) {
				dinfo[fildes].state = CLOSE;
				return 0;
			}
			if (dinfo[fildes].rinfo.ret > 0) {
				dinfo[fildes].rinfo.nowins = 
				dinfo[fildes].rinfo.ret;
				dinfo[fildes].rinfo.nowins--;
				dinfo[fildes].state = IN_CLOSE;
			}
			}
		}
	
2		mptr = dinfo[fildes].r_mptr;	
		win = w_mgmt_ptr[SERVER_ACTIVE_WIN]; 
		w_attr_ptr = (int*)(dinfo[fildes].r_w_attr_ptr_offset +
		WINDOW_INDEX(win));
		w_dptr =  mptr + w_attr_ptr[DBUF_OFFSET];
		w_dptr = w_dptr + w_attr_ptr[START_ADDR];

		if (nbyte <= w_attr_ptr[csz]) {
			bcopy(w_dptr, buf, nbyte);  
			w_attr_ptr[start_addr] = nbyte ;
			w_attr_ptr[csz] = w_attr_ptr[csz] - nbyte ;
		    } else if (nbyte > w_attr_ptr[CSZ]) {
			bcopy(w_dptr, buf, w_attr_ptr[CSZ]);   
			nbyte = w_attr_ptr[CSZ];
		    w_attr_ptr[CSZ] = 0;
		}

		if (w_attr_ptr[CSZ] == 0) {
			w_attr_ptr[START_ADDR] = 0 ;
			w_mgmt_ptr[SERVER_ACTIVE_WIN]++;
			w_mgmt_ptr[SERVER_ACTIVE_WIN] = 
 			w_mgmt_ptr[SERVER_ACTIVE_WIN] 
	% tparams.nowins;
			dinfo[fildes].partial_read_flag = 0;
		}else {
			dinfo[fildes].partial_read_flag = 1;
		}

		return nbyte;
	}
}

Back to Top

c. write

The write is similar to the server-side write function. Client-data is bcopied to a memory mapped slot, a fast context switch is executed to enter the door_service function on the server, and the waiting server read thread is woken up.

Code Example 12. Client write Function in Implementation II
ssize_t write(int fildes, const void *buf, size_t nbyte)
{
	if (fildes > 0 && dinfo[fildes].fd == fildes) {
		cbuf = (void*)buf;
		csz = nbyte;
		w_mgmt_ptr = dinfo[fildes].w_w_mgmt_ptr;
		mptr = dinfo[fildes].w_mptr;

		while(csz > 0) {
		win = w_mgmt_ptr[CLIENT_ACTIVE_WIN];	
		w_attr_ptr = (int*) (dinfo[fildes].w_w_attr_ptr_offset +
					WINDOW_INDEX(win));
		w_dptr =  mptr + w_attr_ptr[DBUF_OFFSET];
			if (csz <= w_attr_ptr[sz]) {
				bcopy(cbuf, w_dptr, csz);  
				w_attr_ptr[csz] = csz;
				cbuf = ((char*)cbuf) + csz;
				csz = 0;
			} else if (csz > w_attr_ptr[SZ]) {
				bcopy(cbuf, w_dptr, w_attr_ptr[SZ]);   
				 w_attr_ptr[CSZ] = w_attr_ptr[SZ];
				csz = csz - w_attr_ptr[SZ];
				cbuf = ((char*)cbuf) + w_attr_ptr[SZ];
			}

			dinfo[fildes].winfo.size=nbyte;
			dinfo[fildes].winfo.type=WRITE;
			dinfo[fildes].winfo.port= dinfo[fildes].port;
			darg.data_ptr = (char *)&dinfo[fildes].winfo;
			darg.data_size = sizeof(writeinfo);
			darg.desc_ptr = NULL;
			darg.desc_num = 0;
			darg.rbuf = NULL;
			darg.rsize = 0;
			door_call(door_fd, &darg);
		}

		return nbyte;
	}
...
}

Back to Top


Configuration Environment Variables

As mentioned earlier the memory mapped segment is divided into windows, and each window is divided into slots. The number of windows is not configurable at this time, but the number and size of slots are configurable through environment variables. Currently, the number of slots is limited to 152. This could be increased for better performance.

Environment Variable Value Range SPEED_NOWINS2-152 SPEED_WINSIZE256-8192

Back to Top


Speed Implementation with Memory Map Only - Implementation III

This implementation is similar to implementation II as discussed above, but doors IPC is not used for context switching. Instead, system-scope semaphores are created in the shared mmapped space, and a sema_post is executed on these semaphores to signal data availability. The bind, accept and connect functions have similar functionality.

Server Side Functions

a. read

The read function is similar to the read function in Code Example 9. The server read thread waits on a system-scope w_mgmt_ptr[SEMA_R_O] semaphore. The client writes data directly into the mmapped slot, and executes a sema_post on w_mgmt_ptr[SEMA_R_O] semaphore to wake up the server read thread. The read thread executes a bcopy to transfer the data from the memory mapped slot into the server read buffer.

Back to Top

Code Example 13. Server read Function in Implementation II
ssize_t read(int fd, void *buf, size_t nbyte)
{
...
	if (fd > 0 && pmap[fd].fd == fd) {
1		w_mgmt_ptr = pmap[fd].r_w_mgmt_ptr;
		if (pmap[fd].partial_read_flag == 0) {
			sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_R_O];
			while(sema_wait(sema_ptr)); 
		}

2		win = w_mgmt_ptr[SERVER_ACTIVE_WIN]; 
		w_attr_ptr = (int*)(pmap[fd].r_w_attr_ptr_offset + 
		WINDOW_INDEX(win));
		mptr = pmap[fd].r_mptr;
		w_dptr =  mptr + w_attr_ptr[DBUF_OFFSET];
		w_dptr = w_dptr + w_attr_ptr[START_ADDR];

		if (nbyte <= w_attr_ptr[csz]) {
			bcopy(w_dptr, buf, nbyte);  
			w_attr_ptr[start_addr] = nbyte ;
			w_attr_ptr[csz] = w_attr_ptr[csz] - nbyte ;
		} else if (nbyte > w_attr_ptr[CSZ]) {
			bcopy(w_dptr, buf, w_attr_ptr[CSZ]);   
			nbyte = w_attr_ptr[CSZ]; 
			w_attr_ptr[CSZ] = 0;
		}
3		if (w_attr_ptr[CSZ] == 0) {
			w_attr_ptr[START_ADDR] = 0 ;
			w_mgmt_ptr[SERVER_ACTIVE_WIN]++;
			w_mgmt_ptr[SERVER_ACTIVE_WIN] = 		
			w_mgmt_ptr[SERVER_ACTIVE_WIN]
				% tparams.nowins;
			rd_empty++;
			pmap[fd].partial_read_flag = 0;
			sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_R_E];
			sema_post(sema_ptr); 
		}else {
			pmap[fd].partial_read_flag = 1;
		}
		return nbyte;
	}
	return ((*fptr)(fd, buf, nbyte));
}

Back to Top

b. write

The write function is similar to the write function described in implementation II. The write thread waits on a system-scope w_mgmt_ptr[SEMA_W_E] semaphore. If successful, data is bcopied into the memory mapped slot, and a sema_post is executed on the w_mgmt_ptr[SEMA_W_O] semaphore to wake up the client read thread.

Code Example 14. Server write Function in Implementation III
ssize_t write(int fd, const void *buf, size_t nbyte)
{
...
	if (fd > 0 && pmap[fd].fd == fd) {
		cbuf = (void*)buf;
		csz = nbyte;
		w_mgmt_ptr = pmap[fd].w_w_mgmt_ptr;

1		mptr = pmap[fd].w_mptr;
		while(csz > 0) {
			sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_W_E];
			while(sema_wait(sema_ptr));

2			win = w_mgmt_ptr[CLIENT_ACTIVE_WIN];	
			w_attr_ptr = (int*) (pmap[fd].w_w_attr_ptr_offset + 	
			WINDOW_INDEX(win));
			w_dptr =  mptr + w_attr_ptr[DBUF_OFFSET];
			if (csz <= w_attr_ptr[sz]) {
				bcopy(cbuf, w_dptr, csz);  
				w_attr_ptr[csz] = csz;
				cbuf = ((char*)cbuf) + csz;
				csz = 0;
			} else if (csz > w_attr_ptr[SZ]) {
				bcopy(cbuf, w_dptr, w_attr_ptr[SZ]);   
				 w_attr_ptr[CSZ] = w_attr_ptr[SZ];
				csz = csz - w_attr_ptr[SZ];
				cbuf = ((char*)cbuf) + w_attr_ptr[SZ];
			}
			w_mgmt_ptr[CLIENT_ACTIVE_WIN]++;
			w_mgmt_ptr[CLIENT_ACTIVE_WIN] =  	
			w_mgmt_ptr[CLIENT_ACTIVE_WIN]  
% tparams.nowins;

3			sema_ptr = (sema_t*)&w_mgmt_ptr[SEMA_W_O];
 			sema_post(sema_ptr); 
		}
...
}

Back to Top

Client Side

The read and write functions on the client side are similar to the read and write functions on the server side, as shown above.

Configuration Environment Variables

Just as in implementation II, the number and size of slots are configurable through environment variables. Currently, the number of slots is limited to 152. This could be increased for better performance.

Environment Variable Value Range SPEED_NOWINS2-152 SPEED_WINSIZE256-8192

Back to Top


Performance Comparisons

To measure and compare the performance of the various Speed implementations, several tests were carried out using a publicly available multithreaded client-server program. The programs are from "Multithreaded Programming with PThreads"[6]. Some simple modifications were made to the client.c and server_ms.c code. The socket file descriptor was set to fcntl(NDELAY), fcntl(NOBLOCK), TCP_NODELAY to ensure fast response times. TNF instrumentation was added to get more accurate read and write latencies. The client-server application was first run without the Speed library being interposed, in other words, with libsocket.so only, and the measurements were recorded. The application was then run with the different implementations of the Speed library interposed, and the measurements were again recorded. The results are discussed in detail in the following sections.

Back to Top

Latency Measurements Using TNF Instrumentation

To obtain the latency of the read and write functions, the client and server were executed separately with ktrace and prex. The measurements were first recorded without the Speed library and with libsocket.so only. The Speed library implementations were interposed and the latencies were again measured. The hardware used was an E450 with 400 Mhz, four CPUs, two gigabytes of memory, running Solaris 8 (with no updates) in 32-bit mode.

The client and server were run with all CPUs enabled and with no processor set, and the times were captured for different message sizes. Then a processor set was created with two CPUs, with interrupts disabled in the set. The server was run in this processor set; the server was started, and all the LWPs were bound to this processor set by using psrset -b [set] serverpid. The TNF latency data was tabulated and is shown in the tables below.

Average Latency to Send and Receive a Message of 70 Bytes

Latency measurements on the server side

Without Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.1256774562899980.0297473708562897100001 With Speed Implementation II interposed0.01108095234047730.0203847802021977100001 With Speed Implementation III interposed0.02054407587924170.0206150824791755100001 With Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.124444387940001.0280241496185034100001 With Speed Implementation II interposed0.05266727330726470.0147964380856189100001 With Speed Implementation III interposed0.02086329081709190.02032837898621100001

Back to Top

Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
3. The server was started with the following window configuration for measurements with Speed library interposed:

SPEED_WINSIZE=1024
SPEED_NOWINS=152

Back to Top

Latency measurements on the Client Side

Without Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.1256774562899980.157995061719998100001 With Speed Implementation II interposed0.06218578303999930.0559983035969648100001 With Speed Implementation III interposed0.05879338122618910.0535122446399991100001 With Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.08934535588000210.0763784739052616100001 With Speed Implementation II interposed0.1120573339399980.106013310426899100001 With Speed Implementation III interposed0.05868365956999970.0521766302836968100001

Back to Top

Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.

Average Latency to Send and Receive a Message of 512 Bytes

Server Side

Without Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.04021890733000030.0491265955040463100001 With Speed Implementation II interposed0.01948866605333820.0133252572474274100001 With Speed Implementation III interposed0.02119825282747190.0224482145878531100001 With Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.03967906020999980.039741772542272100001 With Speed Implementation II interposed0.06615240602594260.01320928669713100001 With Speed Implementation III interposed0.0225980173998270.0236643345866558100001

Back to Top

Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
3. The server was started with the following window configuration for measurements with Speed library interposed:

SPEED_WINSIZE=1024
SPEED_NOWINS=152

Client Side

Without Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.1124123331760.10398922812100001 With Speed Implementation II interposed0.06716004550999870.0595745007849909100001 With Speed Implementation III interposed0.06211866816331930.0562640110500007100001 With Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.09419842488999850.085334515084848100001 With Speed Implementation II interposed0.1128328794400010.106662469745304100001 With Speed Implementation III interposed0.05868365956999970.0521766302836968100001

Back to Top

Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.


Average Latency to Send and Receive a Message of 1000 Bytes

Server Side

Without Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.03978691558000240.0670270991090072100001 With Speed Implementation II interposed0.01670157952420360.012646576994229100001 With Speed Implementation III interposed0.02409163350366570.024710518264817100001 With Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.04081555633000220.0491813939560609100001 With Speed Implementation II interposed0.04805950418495850.0201019823701757100001 With Speed Implementation III interposed0.02409163350366570.024710518264817100001

Back to Top

Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.
3. The server was started with the following window configuration for measurements with Speed library interposed:

SPEED_WINSIZE=1024
SPEED_NOWINS=152

Client Side

Without Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.1109009862999990.101823674323258100001 With Speed Implementation II interposed0.06274180309196980.0681796264200012100001 With Speed Implementation III interposed0.06616029528999950.0600366353736469100001 With Processor Set  Read in secsWrite in secsNumber of messages Client-server only0.1025926588599990.0929193959160412100001 With Speed Implementation II interposed0.06817962642000120.0627418030919698100001 With Speed Implementation III interposed0.06783880860000030.0616162311676886100001

Back to Top

Note:
1. Speed Implementation I had problems with TNF being interposed.
2. "client-server only" i.e., with just libsocket.so, without Speed library interposition.

Performance Measurements

The Speed library was recompiled without the TNF instrumentation, and the performance of the client and server were again measured without interposing the Speed Library, in other words, with only libsocket.so. The measurements were again made with the different Speed libraries interposed. The timings were recorded with the iobench routine, which is part of the client program. This routine uses the proc system to get very accurate measurements.

The client-server application was run with different CPU configurations and in a processor set to measure the optimum performance. The results were then tabulated. Only the best configurations are shown below.

Table Key:

  1. Client-server only, no Speed library; server run in a processor set of two CPUs, interrupt disabled; client run on the remaining two CPUs
  2. Speed Implementation I interposed, no processor set, four CPUs
  3. Speed implementation II interposed, no processor set, four CPUs
  4. Speed implementation III interposed, no processor set, two CPUs

Send and Receive 100,000 Messages of 70 Bytes

Run Information TimeIIIIIIIV Elapsed16.18796.257775.998894.89628 Total96.8937.545135.992114.6787 CPU4.66995.606256.12624.57105 User1.755252.090862.553582.8791 System2.914653.515393.572611.69194 Trap0.0004895660.0001062949.4778e-050.000822242 Wait0.2253470.001197090.0004980264.08094 Process-1.61141e+06-1.61206e+06-1.61111e+06-1.64029e+06 Stopped1.736e-051.324e-051.67e-052.6115e-05 Voluntary context switches87131 Involuntary context switches0012 CPU usage28.9%89.6%102.2%93.4%

Send and Receive 100,000 Messages of 512 Bytes

Run Information TimeIIIIIIIV Elapsed9.771468.044536.83385.13338 Total58.619148.265641.001515.3894 CPU7.498727.468397.202274.83241 User1.842762.664593.666893.3498 System5.655964.80383.535391.48261 Trap0.0008576730.0001590980.0001934560.00167688 Wait0.6209770.001569860.0003362084.5697 Process-1.61144e+06-1.61208e+06-1.61113e+06-1.64049e+06 Stopped1.839e-051.275e-051.2425e-052.2145e-05 Voluntary context switches87101 Involuntary context switches0012 CPU usage76.8%92.9%105.4%94.2%

Send and Receive 100,000 Messages of 1000 Bytes

Run Information TimeIIIIIIIV Elapsed10.813110.35046.560885.61761 Total64.78362.100339.36416.8415 CPU10.50518.534397.071745.3118 User2.56222.885873.482463.68684 System7.942895.648533.589281.62496 Trap0.0005998360.0002114430.0001579570.00278208 Wait1.079220.001088080.0005551765.13579 Process-1.61145e+06-1.61209e+06-1.61114e+06-1.64038e+06 Stopped1.6255e-052.166e-051.63e-052.4335e-05 Voluntary context switches77101 Involuntary context switches0012 CPU usage97.2%82.5%107.8%94.6%

bcopy Time

The time to bcopy 1000 bytes 100,000 times from a user buffer to a memory mapped Speed buffer and vice versa, was measured using the real-time function gethrtime. The average time of the read and write operation is shown in the table below. This was measured to estimate the actual time spent copying data as opposed to the system-time component, such as the time to context switch, time for sema operations, and so forth. From this average time, the time to bcopy 512 bytes and 70 bytes was deduced.

Time to bcopy Iterations1000 bytes in ms512 bytes in ms70 bytes in ms 10000049025034.3

Volano 1.0 Performance

Running the Volano Mark 1.0 with the Speed library interposed boosts performance by 5x times. The Speed library does not work with the newer version of Volano 2.X as the poll() call is not yet supported by the library.

Back to Top


Observations

  1. Speed Implementation III seems to be the fastest. The best time was obtained when run under a two CPU configuration. While this configuration turned out to be the best for Speed Implementation III, the others performed badly with this configuration, as they are more heavy weight and need more system resources to perform optimally.
  2. Speed Implementation I has considerable overhead as it uses the kernel to dynamically allocate memory to copy the client and server data and transfer it from one address space to the other.
  3. Speed Implementation II uses mmap to copy the data to overcome the previous limitation. This seems to perform well, as the doors IPC is used only to make a fast context switch and to return status.
  4. Since Speed Implementation III uses system semaphores to signal the other process, this might be heavy weight and might incur scheduling time, which could be alleviated by running the server under a process set.
  5. The client-server application without the Speed library interposed seems to perform very badly when the message size is small, but performance improves considerably with increased message size.
  6. The Speed design uses slots to copy the application data. This could result in fragmentation if an application tries to read or write data sizes greater than the slot sizes. Therefore, with data sizes larger than the slots sizes, the TCP socket might perform better than the Speed library implementation. This was not tested or measured.
  7. The current Speed implementation has a limitation of two windows for read operations and three windows for write operations. A better performance may result if this is configurable and is increased.
  8. The number of slots is limited to 152. This limitation arises out of using a limited memory, a page for buffer management. Increasing the memory size removes the limitation and may yield better numbers.
  9. The bcopy time increases with an increase in data size and approaches user time. The time required to bcopy 1000 byte messages 100,000 times is almost 1.9 seconds because there are four bcopy operations in a full duplex operation, which is a client-write to the server and a client-read from the server. This is about 35% in Speed Implementation III. Most of the remaining 65% is a constant overhead of setting up the memory mapped space, connections, and so forth.
  10. The latency data on the server side for Speed Implementation II seems to be faster. However, this does not include the time spent in the door_service routine. Including this time should bring the time closer to the time seen for Speed Implementation III.
  11. Since the design is an interpose of the TCP socket library, client-server applications written in a variety of languages such as C, C++, or Java, should be able to use it seamlessly.

Back to Top


For Further Research

  1. With the Speed library interposed, the bcopy time approaches user-time as message sizes increase. The threshold when this happens needs to be measured.
  2. TCP sockets perform better with an increase in message sizes. With multithreaded client configurations, five threads, the performance with large messages sizes, 7000 bytes per read and write, approaches the performance with Speed library interposed. This is due to the system being able to schedule the requests concurrently and squeeze the wait-time. But at some threshold, the system will become a bottleneck, as it may not be able to squeeze more wait-time. This threshold needs to be studied.
  3. With the Speed library interposed, the application becomes compute bound instead of being I/O bound. For sustained sessions, the time-quanta available gets reduced with the Solaris time-share class. This can be alleviated by running with a modified priority or by modifying the dispatch table. This needs to be studied further.
  4. Currently, the Speed library interposition works as an IPC mechanism. For connections across the network, the regular TCP/IP sockets are used. The same concept can be extended and made to work across a network by interposing the TCP/IP kernel module. With this extension, incoming and outgoing messages can be transferred directly between the driver and the application.
  5. Providing an API to copy data to and from the mmapped space should cut bcopy times by half. This needs to be explored.

Back to Top


Conclusion

Interposing the TCP socket library with the Speed library boosts client-server performance by more than 100%. Speed Implementation II and III offer significant benefits over TCP/IP for interprocess communication. TCP/IP does not perform well with small message sizes, but performs well with an increase in message sizes. Speed Implementation III outperforms the other implementations, including TCP/IP on a per CPU basis. In fact, the user time approaches bcopy time with an increase in message size. At the moment, four bcopies are needed for a successful read and write operation, as data needs to be copied from the application buffer to the Speed memory mapped buffer and back. This can be reduced by half if an API is exposed, allowing the client and server applications to write directly to the memory mapped space instead of using read and write calls. This could offer a further boost to performance for bigger message sizes.

Back to Top


Download

You can download the source and test data.

Back to Top


Acknowledgements

We would like to thank Bob Palowoda for his expert advice and Ezhilan Narasimhan for his work on the DoorLet, which initiated the idea for this project. We would also like to thank Rupa Nagendra for helping with the tables and formatting of this document.

Back to Top


References

  1. Spring Nucleus, A Microkernel for Objects, Graham Hamilton, Panos Kougiouris, 1993
  2. SpringOS Doors in Solaris, Jim Voll
  3. TCP Network Programming (Volumes 1 and 2), Richard Stevens
  4. Solaris Internals, Richard McDougall, Jim Mauro
  5. Inside Solaris Columns, www.sunworld.com, Jim Mauro
  6. Multithreaded Programming with Pthreads, Lewis and Berg. Programs used with permission.
  7. http://docs.sun.com

Back to Top

Rate and Review Tell us what you think of the content of this page. Excellent   Good   Fair   Poor   Comments:
If you would like a reply to your comment, please submit your email address:
Note: We may not respond to all submitted comments.
Close    To Top
  • Prev Article-OS:
  • Next Article-OS:
  • Now: Tutorial for Web and Software Design > OS > Solaris > OS Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction