SMP-based systems provide a high degree of concurrency and parallelism that multi-threaded applications can use to scale linearly, and thus deliver high throughput. This article proposes a high-performance server architecture that delivers high throughput by minimizing wait latencies in processing.
The proposed server architecture categorizes the server side of the client-server paradigm into preprocessing, processing, and postprocessing stages.
The client response time is the sum of the time spent in each stage (preprocessing, processing, postprocessing) plus the network latency. Since response time is inversely proportional to throughput, minimizing response time leads to a high throughput. The time spent in each stage is further broken down into two categories: service time and wait time. Service time is the compute time required to process a request, and wait time is the time spent waiting for resources such as locks and queue slots. Service time is dependent on the available computing power, while wait time is dependent on the locking needed to achieve the desired degree of concurrency and parallelism. Ideally, service time should decrease with an increase in computing power; however, service time can be severely limited by wait time, which is architecture dependent.
The stages should be loosely coupled with minimal dependencies, as they can be executed in parallel. Critical sections within these stages should be eliminated by use of counters that are incremented atomically. Eliminating the critical sections reduces the wait latency leading to a higher throughput.
A good application for the proposed architecture is in the telecommunication domain. Telecommunication applications, such as softswitches and call-processing applications, are high-performance, highly available, client-server applications. The client-server communication is usually UDP-based messages. With this type of application, the server needs to handle burst loads during peak calling hours and sustain heavy loads for long sessions. Therefore, the underlying architecture needs to scale appropriately to handle peak loads.
This article examines the wait-latency of the various stages, as well as the scalability of the architecture with different multi-threaded and processor configurations.
Because UDP is a connectionless protocol that does not guarantee message delivery, every UDP-based message that is sent must be received to achieve maximum throughput. To ensure that all messages are received, the message-receiving portion of the server application code must have faster performance than other parts of the server. This section of the server needs to queue incoming messages so as to not block while a message is being processed. Once a message has been processed, the message must be queued again, freeing the processing code while the response is sent back. Therefore, to achieve optimum performance, a three-stage, loosely coupled processing is needed. The stages should not have any dependency on each other except to get and put messages. Each stage can be executed concurrently or in parallel, with in-process (one Solaris process) processing or out-of-process (more than one Solaris process) processing. The communication between the stages should be minimal and limited to notification of a queued message.
To implement this design, the prototype server is divided into preprocessing, processing, and postprocessing stages. An in-process pool of threads is used to execute each stage. Although the architecture is designed to handle out-of-process processing, this type of processing is not discussed in this article.
Since the message queue is shared between stages, a producer-consumer paradigm is used to store and consume messages. Counting semaphores, semaP (producer) and semaC (consumer), regulate access to the queue. The semaphore semaP is initialized to the available slots, while the semaphore semaC is initialized to 0.
Each stage has an associated thread pool. The processing stage has multiple thread pools if processing is asynchronous.
The thread pools are configurable at startup. The demandPOOL is enabled only if the workerPOOL is enabled. The threads in the demandPOOL are transient; in other words, they are created with every processing request and destroyed at the end of the request. These threads are created only if the threads in the workerPOOL are busy.
The thread pools can be run either in a real-time or timesharing scheduling class.
The preprocessing stage stores incoming messages. In this stage, a sema_wait() is performed on semaP to gain access to the message queue. If successful, a slot pointer is obtained by atomically incrementing msgqIDX. This pointer is supplied to recvfrom() to receive a UDP message. Once a message is received, access to the index queue iQ is obtained (a sema_wait() on the semaB semaphore may be performed in case of multiple threads). The integer iqIDX is atomically incremented to obtain an index slot and the msgqIDX is stored. A sema_post() operation is performed on the semaC semaphore to communicate a message arrival. This should unblock a servicePOOL thread.
Once a message is retrieved, it is processed using a lookup function. The lookup function does a fast lookup using a simple in-memory database. The processed message is then stored in the postprocessing message queue by the update_response() function.
Figure 5. Asynchronous Processing in the servicePOOL Thread
(click to enlarge image)
Figure 6. Asynchronous Processing in the workerPOOL Thread
(click to enlarge image)
Figure 7. Asynchronous Processing in the demandPOOL Thread
Figure 8 illustrates the execution flow of the update_response() function for synchronous and asychronous processing.
Figure 8. update_response Function
(click to enlarge image)
Postprocessing
In the postprocessing stage, the respPOOL threads wait for outgoing messages by blocking on the semaC semaphore of the postprocessing message queue. On a sema_post() call from the processing stage, the respPOOL thread retrieves the outgoing message by automically incrementing the postprocessing index queue integer iqIDX. It then indexes into the iQ queue to retrieve the msgQ slot number, copies the message from the message queue, and sets the msgQ slot to null, indicating message delivery. The message is then sent on a UDP stream. This processing is illustrated in Figure 9.
Figure 9. Postprocessing Stage
(click to enlarge image)
Back to Top
Server Configuration
Environment variables control the configuration of the server environment. In the current implementation, all the variables are static and changes are read at start time, but the design is flexible and can be made dynamic. The most important environment variables are shown in the table below.
Environment variable
Description
Default value
CPSS_RECV_THREADS
Preprocessing thread pool
1
CPSS_RESP_THREADS
Postprocessing thread pool
1
CPSS_SERVICE_THREADS
Processing, service thread pool
1
CPSS_WORKER_THREADS
Processing, worker thread pool
0
CPSS_ONDEMAND_THREADS
Processing, demand thread pool
unset
CPSS_RECV_THREADS_RT
Run preprocessing thread pool in RT class
unset
CPSS_RESP_THREADS_RT
Run postprocessing thread pool in RT class
unset
CPSS_SYNCH_PROCESS
Threads created with system or process scope, default system scope. Synchronization variables created with process scope
unset
CPSS_QSLOTS
Size of the message queue
8192
CPSS_MSGSIZE
Size of each msgQ slot or window
1024
CPSS_RECV_STATS
Preprocessing stage statistics, number of messages received, written into mmap space. Use getstat or another program to monitor statistics.
unset
CPSS_RESP_STATS
Postprocessing stage statistics, number of messages sent out, written into mmap space. Use getstat or another program to monitor statistics.
unset
Server Statistics
Server performance statistics can be obtained by setting the environment variables CPSS_RECV_STATS and CPSS_RESP_STATS. Setting these variables turns on the counters to track incoming and outgoing messages. These variables are in mmap space so that an external program can be used to monitor the statistics. For example, getstat is a simple utility to retrieve these statistics.
Server Performance Measurements
Test Environment
To test the implementation, two simple client applications were written, one to load the server with UDP messages, and the other to verify the responses coming back from the server. The prototype server was run on a Sun E450 with four CPUs (400mhz) running Solaris 8 with two gigabytes of memory. The load client was run on an ix86-XEON (xeonL) with two CPUs (500mhz) running Solaris 8 with 512 megabytes of memory. The response client was run on an ix86-XEON (xeonR) with four CPUs (400mhz) and one gigabyte of memory.
Test Network
The E450 had a 100 Mbit/sec hme0 public network interface and two 100 Mbit/sec qfe private interfaces. The xeonL was connected to one of the private interfaces with a cross-over cable to avoid any switch or hub latencies. The xeonR was connected to the other private interface across a 100 Mbit/sec switch. Figure 10 illustrates the test network.
Figure 10. Test Network
(click to enlarge image)
Load Generation
To ensure that enough load was generated, a processor set was created on the xeonL and two instances of the client were run in parallel. The xeonL was able to send 2,020,222 messages in about 49 seconds generating about 97-98 Mbit/sec of traffic.
Compiler Settings
The server code was compiled with the Sun WorkShop 5.0 C compiler set to 04 optimization with the -fast option turned on. The client code was compiled with gcc set to the -03 optimization option.
TNF Instrumentation
To get accurate performance measurements, the server code was instrumented with TNF instrumentation. Turning on the TNF_DEBUG stubs provided more information about wait latencies.
Latency Test Run
To capture wait latencies, such as system calls, semaphore waits, mutex_lock and unlock, and queue slot retrieval, the application was first run with the prex command in kernel mode with thread and system call-tracing enabled. The pfilter option was used to filter server-related system events. Then prex was run again to collect the application probe timings.
Each test was run for a 30 second interval and a kernel-application TNF snapshot was taken. Since neither ktrace or tnftrace provide the option to take application as well as system snapshots, the supplied ktrace script was modified to do this. tnfdump was used to convert the binary TNF snapshots into text tables. A custom Perl script, tnfex.pl, calculated the different latencies from this table (similar to tnfview without the GUI). This latency table was converted into a flat file database using a makelatdb.pl script, and analyzed using a latdb_analyze.pl script to calculate the percentages and tabulate the relationships between the different latencies. This collection process was automated using shell scripts, .rhosts, and rsh to automatically run and collect the TNF dumps for various processor and thread configurations.
Scaling Test Run
The automated process was again used to collect the various data but without the TNF traces. A consistent set of timings was collected for different processor and thread configurations.
Scaling Tests
To test the linear scaling, the server was run with configurations of one CPU, two CPUs, four CPUs, and a processor set with three CPUs (interrupts disabled for the set). For each processor configuration, the server was started with a different thread configuration, such as one recvPOOL thread, one servicePOOL thread, and one respPOOL thread, with the threads running in process or system scope in TS or RT class. The tests were run for 120 seconds each, and at the end the getstats utility was used to obtain the server statistics. The server was restarted for each run.
Test -Client load - 2020222 mgs, Server msgQ size - 2020222
i. Client Statistics
Total msgs sent
Msg Size
Avg time to send msgs in secs
Network packet size
Network utilization
Pkts/Sec
2020222
256
49
298
98.28998
~41200
Back to Top
ii. Server Statistics
The top three configurations with the recvPOOL thread running in RT class and without RT are shown below. The remaining configurations are shown in Appendix B.
E450 with One CPU (Three CPUs offline)
The test configuration key used in the tables below is 0|1|process|r1r1s1w1|rcv_rt, where:
0
1
process
r1r1s1w1
rcv_rt
Processor set
CPUs online
Thread scope
recvPOOL threads
respPOOL threads
servPOOL threads
workerPOOL threads
recv POOL threads in RT class
0-No processor set
1-processor set #
# of CPUS online
system - System scope
process - Process scope
r1 - 1 recvPOOL thread
r1 - 1 respPOOL thread
s1 - 1 servPOOL thread
w1 - 1 workerPOOL thread
rcv_rt - in RT class
none - in TS class
test #
test configuration
Rcv msgs
snd msgs
total msgs
%rcv
%snd
% throughput
1
0|1|process|r1r1s1w1|rcv_rt
457809
457809
915618
22.66132
22.66132
22.66132
2
0|1|process|r1r1s1w1|rcv_rt
457809
457809
915618
22.66132
22.66132
22.66132
3
0|1|process|r2r2s4w4|rcv_rt
453563
453563
907126
22.45115
22.45115
22.45115
4
0|1|process|r1r1s1|none
159822
159822
319644
7.911111
7.911111
7.911111
5
0|1|process|r1r1s1|none
159822
159822
319644
7.911111
7.911111
7.911111
6
0|1|process|r2r2s2w2|none
148208
148208
296416
7.336223
7.336223
7.336223

(click to enlarge image)
Back to Top
E450 with Two CPUs (Two CPUs offline)
test #
test configuration
Rcv msgs
snd msgs
total msgs
%rcv
%snd
% throughput
1
0|2|sys|r1r1s1|rcv_rt
1891132
1444167
3335299
93.61011
71.48556
82.5478
2
0|2|process|r1r1s1w1|rcv_rt
1831189
896594
2727783
90.64296
44.38096
67.512
3
0|2|process|r1r1s1w2|rcv_rt
1826721
1176023
3002744
90.4218
58.21256
74.3172
4
0|2|sys|r2r2s2w2|none
938554
938554
1877108
46.45796
46.45796
46.458
5
0|2|sys|r1r1s1w1|none
860527
860527
1721054
42.59567
42.59567
42.5957
6
0|2|sys|r2r2s2|none
856792
856792
1713584
42.41078
42.41078
42.4108

(click to enlarge image)
Back to Top
E450 with Four CPUs
test #
test configuration
Rcv msgs
snd msgs
total msgs
%rcv
%snd
% throughput
1
0|4|process|r1r1s1|rcv_rt
2020195
2020195
4040390
99.99866
99.99866
99.99866
2
0|4|process|r1r1s1w1|rcv_rt
2019963
1034030
3053993
99.98718
51.18398
75.58558
3
0|4|process|r1r1s1w2|rcv_rt
2011946
1599451
3611397
99.59034
79.17204
89.38119
4
0|4|process|r1r1s1w2|none
2005388
1587301
3592689
99.26572
78.57062
88.91817
5
0|4|process|r1r1s1|none
1975966
1975966
3951932
97.80935
97.80935
97.80935

(click to enlarge image)
Back to Top
Processor Set with Three CPUs (CPUs 1,2,3), interrupts disabled, CPU 0 handles interrupts
test #
test configuration
Rcv msgs
snd msgs
total msgs
%rcv
%snd
% throughput
1
1|3|process|r1r1s1|rcv_rt
2020209
2020209
4040418
99.99936
99.99936
99.99936
2
1|3|process|r1r1s1w1|rcv_rt
2020198
1175111
3195309
99.99881
58.16742
79.08312
3
1|3|process|r1r1s1w2|rcv_rt
1984786
1746591
3731377
98.24594
86.4554
92.35067
4
1|3|process|r1r1s1|none
1806390
1806390
3612780
89.41542
89.41542
89.41542
5
1|3|sys|r1r1s1w1|none
1747759
1747759
3495518
86.51321
86.51321
86.51321
6
1|3|sys|r1r1s1|none
1722285
1722285
3444570
85.25226
85.25226
85.25226

(click to enlarge image)
Back to Top
Linear Scaling with recvPOOL Threads in RT Class
The table below represents the best performing configurations with RT under different CPUs from the preceding tests. The results have been graphed for scalability.
test #
test configuration
Rcv msgs
snd msgs
total client msgs
throughput
1
0|1|process|r1r1s1w1|rcv_rt
457809
457809
2020222
457809
2
0|2|sys|r1r1s1|rcv_rt
1891132
1444167
2020222
1667650
3
0|4|process|r1r1s1|rcv_rt
2020195
2020195
2020222
2020195
4
1|3|process|r1r1s1|rcv_rt
2020209
2020209
2020222
2020209

(click to enlarge image)
Back to Top
Linear scaling with threads in TS class
The table below represents the best performing configurations with no RT under different CPUs from the preceding tests, and has been graphed for scalability.
test #
test configuration
Rcv msgs
snd msgs
total client msgs
throughput
1
0|1|process|r1r1s1|none
159822
159822
2020222
159822
2
0|2|sys|r2r2s2w2|none
938554
938554
2020222
938554
3
1|3|process|r1r1s1|none
1806390
1806390
2020222
1806390
4
0|4|process|r1r1s1|none
1975966
1975966
2020222
1975966

(click to enlarge image)
Back to Top
Latency Analysis
The server was run with different thread and processor configurations under prex for 30 seconds, and a kernel and application TNF snapshot was taken. This snapshot was then tabulated and analyzed for various latencies. The best performance numbers for each test were then interpolated into this tabulated data and were sorted by throughput (descending), total latency (ascending), and test configuration to determine whether the best performing configurations had very low latency. The best ten of these configurations are shown below, and the remainder are available in Appendix A. The highest performing configurations shown in the scaling tables above are reflected in the top ten configurations.
Column Key:
Syscall_count - Number of syscalls
Syscall_lat - Average latency of each call
Sema_wait_lat - Average sema_wait( operation
Total_latency - Wait latencies plus service time

(click to enlarge image)

(click to enlarge image)
Back to Top
Observations
- The best performance results are found in configurations with low latency. When the TNF data is sorted by total delay latency (sum of wait latencies), a few other configurations show up at the top, but the throughput is not the highest and the syscall count is on the higher side.
- As shown in Appendix A, the TNF table, the processing stage wait latency is high for configuration with one worker thread and decreases linearly with an increase in the number of worker threads. Configuration with one
servicePOOL and one workerPOOL thread are tightly coupled. In fact, good throughput is seen with one servicePOOL and four workerPOOL threads. Analyzing further, the current implementation actually functions a bit like the scheduler in that it tries to wake up one available worker thread. It seemed that if the worker threads were busy, the transient demandPOOL threads could lower this latency. However, an unrestricted demandPOOL can easily overwhelm system resources by starting -n number of transient threads. This implementation could be changed by adding a worker stage with a worker message queue to remove the tight coupling, which might improve the performance.
- Use of the atomic library reduces the wait latency, as it is extremely fast. The wait latency to obtain a slot is very low.
- The system seems to perform well with processor sets. A bigger system with more CPUs in the set could yield very good numbers.
- With system-scope threads, the synchronization primitives are initialized for system scope. Better numbers might be realized with process-scope primitives if all the stages are run in-process.
- The server seems to perform well with system-scope threads for two-CPU configurations but not as well with more CPUs. This could not be analyzed as the required data was not captured.
- The RT class should not be used for two-CPU and one-CPU configurations as it might crash the whole system due to priority-inversion problems.
- The atomic library supports incrementing signed integers, so the counters will become negative at some point. Therefore, an unsigned version of this library is needed.
- To maintain the high-throughput at this peak load for sustained sessions, the server process priority needs to be modified as the time-quanta is adjusted based on CPU usage under the TS class. This needs more investigation.
- The binary semaphore could be replaced with a mutex.
- The performance measurements were done with no operating system or network tuning. The
netstat set to -k qfe2 shows nocanput count to be on the higher side. So setting sq_max_size to a higher value like 100+ could improve performance.
- Compiling the code with the new WorkShop 6.1 compiler with different optimization switches might also yield better throughput.
Back to Top
Conclusion
A high throughput can be achieved by reducing wait-latencies and introducing queues to decouple dependencies. Even though the current data reveals that low latency improves performance, a more detailed and exhaustive analysis may be needed to accurately predict a sustainable high-performance under different thread-processor configurations.
Appendixes
See Appendix A (part1, part2) for latency analysis data, and Appendix B for performance data. You can also download the source, test scripts and analysis data.
Acknowledgements
I would like to thank my colleagues, S.R. Venkatraman and Bob Palowoda, for helping me with accesss to resources and providing expert advice on issues related to this project.
References
- Voice over IP Fundamentals, Jonathan Davidson, James Peters
- Speed Library, Nagendra Nagarajayya, S.R. Venkatramanan
- Inside Solaris columns, www.sunworld.com , James Mauro
- Solaris Internals, Richard McDougall, James Mauro
- Solaris documents, docs.sun.com
- The atomics library, www.netwiz.net/~mbennett/atomics, Mike Bennet
- Multi-threading programming with PTHREADS, Bill Lewis and Daniel J. Berge
January 2001