SMP-Based, Multi-Threaded Server Architecture for Softswitch/Call Processing Domain
SMP-Based, Multi-Threaded Server Architecture for Softswitch/Call Processing Domain   By Nagendra Nagarajayya, January 2001  

SMP-based systems provide a high degree of concurrency and parallelism that multi-threaded applications can use to scale linearly, and thus deliver high throughput. This article proposes a high-performance server architecture that delivers high throughput by minimizing wait latencies in processing.

The proposed server architecture categorizes the server side of the client-server paradigm into preprocessing, processing, and postprocessing stages.

The client response time is the sum of the time spent in each stage (preprocessing, processing, postprocessing) plus the network latency. Since response time is inversely proportional to throughput, minimizing response time leads to a high throughput. The time spent in each stage is further broken down into two categories: service time and wait time. Service time is the compute time required to process a request, and wait time is the time spent waiting for resources such as locks and queue slots. Service time is dependent on the available computing power, while wait time is dependent on the locking needed to achieve the desired degree of concurrency and parallelism. Ideally, service time should decrease with an increase in computing power; however, service time can be severely limited by wait time, which is architecture dependent.

The stages should be loosely coupled with minimal dependencies, as they can be executed in parallel. Critical sections within these stages should be eliminated by use of counters that are incremented atomically. Eliminating the critical sections reduces the wait latency leading to a higher throughput.

A good application for the proposed architecture is in the telecommunication domain. Telecommunication applications, such as softswitches and call-processing applications, are high-performance, highly available, client-server applications. The client-server communication is usually UDP-based messages. With this type of application, the server needs to handle burst loads during peak calling hours and sustain heavy loads for long sessions. Therefore, the underlying architecture needs to scale appropriately to handle peak loads.

This article examines the wait-latency of the various stages, as well as the scalability of the architecture with different multi-threaded and processor configurations.

Back to Top


The Server Architecture

Because UDP is a connectionless protocol that does not guarantee message delivery, every UDP-based message that is sent must be received to achieve maximum throughput. To ensure that all messages are received, the message-receiving portion of the server application code must have faster performance than other parts of the server. This section of the server needs to queue incoming messages so as to not block while a message is being processed. Once a message has been processed, the message must be queued again, freeing the processing code while the response is sent back. Therefore, to achieve optimum performance, a three-stage, loosely coupled processing is needed. The stages should not have any dependency on each other except to get and put messages. Each stage can be executed concurrently or in parallel, with in-process (one Solaris process) processing or out-of-process (more than one Solaris process) processing. The communication between the stages should be minimal and limited to notification of a queued message.

To implement this design, the prototype server is divided into preprocessing, processing, and postprocessing stages. An in-process pool of threads is used to execute each stage. Although the architecture is designed to handle out-of-process processing, this type of processing is not discussed in this article.

Queues

Two message queues named msgQ are used to store incoming and outgoing messages. Each queue has fixed-size slots. The slot size and the queue size are configurable. The queues are allocated at startup and memory mapped using the mmap() function with the MAP_SHARED flag set for in-process or out-of-process availability. Data structures for queue management and other shared resources are created in the mapped space. A queue slot is obtained by atomically incrementing an integer slot index msgqIDX. The UDP message is read directly from the socket into this slot. The message queues are shown in Figure 1.

fig. 1 Figure 1. Message Queues

Each message queue is associated with an index queue iQ, which is an array of integers. The slot index msgqIDX is stored in this queue. A queue slot is obtained by atomically incrementing an integer slot index iqIDX. Figure 2 illustrates the index queue.

fig.2 Figure 2. Index Queue

Since the message queue is shared between stages, a producer-consumer paradigm is used to store and consume messages. Counting semaphores, semaP (producer) and semaC (consumer), regulate access to the queue. The semaphore semaP is initialized to the available slots, while the semaphore semaC is initialized to 0.

A binary semaphore semaB is used to regulate access to the index queue iQ. This is the only place where access is serialized, and the processing of the stage adapts to using it based on the active thread count.

Back to Top


Thread Pools

Each stage has an associated thread pool. The processing stage has multiple thread pools if processing is asynchronous.

1. Preprocessing Thread Pool

The preprocessing thread pool, recvPOOL, is configurable and can be run in a real-time (RT) Solaris scheduling class or a timesharing (TS) scheduling class. Since the preprocessing stage needs to run faster than the other stages, a single thread in the real-time scheduling class is used. Use of a single thread in the RT class results in the best performance because the binary semaphore locking is eliminated.

2. Processing Thread Pool

The processing thread pool is made up of the servicePOOL, workerPOOL, and demandPOOL threads. Only the servicePOOL threads are used for synchronous processing, while all three thread pools are used for asynchronous processing.

The thread pools are configurable at startup. The demandPOOL is enabled only if the workerPOOL is enabled. The threads in the demandPOOL are transient; in other words, they are created with every processing request and destroyed at the end of the request. These threads are created only if the threads in the workerPOOL are busy.

The thread pools can be run either in a real-time or timesharing scheduling class.

3. Postprocessing Thread Pool

The postprocessing thread pool, respPOOL, is configurable and can be run in either the RT or TS scheduling class. Testing revealed that the postprocessing stage has the best performance because it has the least latency.

Preprocessing Stage

The preprocessing stage stores incoming messages. In this stage, a sema_wait() is performed on semaP to gain access to the message queue. If successful, a slot pointer is obtained by atomically incrementing msgqIDX. This pointer is supplied to recvfrom() to receive a UDP message. Once a message is received, access to the index queue iQ is obtained (a sema_wait() on the semaB semaphore may be performed in case of multiple threads). The integer iqIDX is atomically incremented to obtain an index slot and the msgqIDX is stored. A sema_post() operation is performed on the semaC semaphore to communicate a message arrival. This should unblock a servicePOOL thread.

Figure 3 illustrates the flow of execution in the preprocessing stage.

fig3 Figure 3. Preprocessing Stage
(click to enlarge image)

Back to Top


Processing Stage

In the processing stage, the servicePOOL threads block on the semaC semaphore waiting for a message from the preprocessing message queue. On a wake-up call, the thread obtains the message slot by atomically incrementing the processing stage index slot integer iqIDX. The thread indexes into the iQ index queue using the value of iqIDX and retrieves the msgQ slot number. It copies the message to local storage and sets the slot to null, indicating message delivery.

Once a message is retrieved, it is processed using a lookup function. The lookup function does a fast lookup using a simple in-memory database. The processed message is then stored in the postprocessing message queue by the update_response() function.

The update_response() function is similar in behavior to the preprocessing stage except that the message comes from the lookup function instead of from the UDP stream. The update-response() function executes all the steps described in the preprocessing stage: it gets access to the postprocessing message queue by calling sema_wait() on the semaP semaphore; it obtains a slot pointer by incrementing msgqIDX and copies the processed message to this slot; it gets an index slot by incrementing iqIDX and stores the msgQ slot number in it. It then unblocks a postprocessing thread by calling sema_post() on the semaC semaphore. See Figure 8 for an illustration of update_response().

1. Synchronous Processing

Processing is synchronous if the servicePOOL threads are active and the workerPOOL is disabled. In synchronous processing, all the functionality described above is executed in the servicePOOL thread. Figure 4 shows the flow of execution for synchronous processing.

fig4 Figure 4. Synchronous Processing
(click to enlarge image)

2. Asynchronous Processing

Processing is asynchronous if workerPOOL threads are active. In asynchronous processing, when a message is retrieved, the servicePOOL thread passes the message to an available workerPOOL thread. If the workerPOOL threads are busy, and if the demandPOOL threads are activated, the retrieved message is passed on to a transient demand thread for further processing. While the preprocessing stage behaves like a client of the processing stage by queuing incoming messages, the processing stage behaves like a client to the postprocessing stage by queuing out-going messages.

Figures 5, 6, and 7 illustrate the flow of asynchronous processing in the servicePOOL, workerPOOL, and demandPOOL threads.

fig5 Figure 5. Asynchronous Processing in the servicePOOL Thread
(click to enlarge image)

fig6 Figure 6. Asynchronous Processing in the workerPOOL Thread
(click to enlarge image)


fig.7 Figure 7. Asynchronous Processing in the demandPOOL Thread


Figure 8 illustrates the execution flow of the update_response() function for synchronous and asychronous processing.

fig8 Figure 8. update_response Function
(click to enlarge image)


Postprocessing

In the postprocessing stage, the respPOOL threads wait for outgoing messages by blocking on the semaC semaphore of the postprocessing message queue. On a sema_post() call from the processing stage, the respPOOL thread retrieves the outgoing message by automically incrementing the postprocessing index queue integer iqIDX. It then indexes into the iQ queue to retrieve the msgQ slot number, copies the message from the message queue, and sets the msgQ slot to null, indicating message delivery. The message is then sent on a UDP stream. This processing is illustrated in Figure 9.

fig9 Figure 9. Postprocessing Stage
(click to enlarge image)


Back to Top


Server Configuration

Environment variables control the configuration of the server environment. In the current implementation, all the variables are static and changes are read at start time, but the design is flexible and can be made dynamic. The most important environment variables are shown in the table below.

Environment variable Description Default value CPSS_RECV_THREADS Preprocessing thread pool 1 CPSS_RESP_THREADS Postprocessing thread pool 1 CPSS_SERVICE_THREADS Processing, service thread pool 1 CPSS_WORKER_THREADS Processing, worker thread pool 0 CPSS_ONDEMAND_THREADS Processing, demand thread pool unset CPSS_RECV_THREADS_RT Run preprocessing thread pool in RT class unset CPSS_RESP_THREADS_RT Run postprocessing thread pool in RT class unset CPSS_SYNCH_PROCESS Threads created with system or process scope, default system scope. Synchronization variables created with process scope unset CPSS_QSLOTS Size of the message queue 8192 CPSS_MSGSIZE Size of each msgQ slot or window 1024 CPSS_RECV_STATS Preprocessing stage statistics, number of messages received, written into mmap space. Use getstat or another program to monitor statistics. unset CPSS_RESP_STATS Postprocessing stage statistics, number of messages sent out, written into mmap space. Use getstat or another program to monitor statistics. unset

Server Statistics

Server performance statistics can be obtained by setting the environment variables CPSS_RECV_STATS and CPSS_RESP_STATS. Setting these variables turns on the counters to track incoming and outgoing messages. These variables are in mmap space so that an external program can be used to monitor the statistics. For example, getstat is a simple utility to retrieve these statistics.

Server Performance Measurements

Test Environment

To test the implementation, two simple client applications were written, one to load the server with UDP messages, and the other to verify the responses coming back from the server. The prototype server was run on a Sun E450 with four CPUs (400mhz) running Solaris 8 with two gigabytes of memory. The load client was run on an ix86-XEON (xeonL) with two CPUs (500mhz) running Solaris 8 with 512 megabytes of memory. The response client was run on an ix86-XEON (xeonR) with four CPUs (400mhz) and one gigabyte of memory.

Test Network

The E450 had a 100 Mbit/sec hme0 public network interface and two 100 Mbit/sec qfe private interfaces. The xeonL was connected to one of the private interfaces with a cross-over cable to avoid any switch or hub latencies. The xeonR was connected to the other private interface across a 100 Mbit/sec switch. Figure 10 illustrates the test network.

fig10 Figure 10. Test Network
(click to enlarge image)


Load Generation

To ensure that enough load was generated, a processor set was created on the xeonL and two instances of the client were run in parallel. The xeonL was able to send 2,020,222 messages in about 49 seconds generating about 97-98 Mbit/sec of traffic.

Compiler Settings

The server code was compiled with the Sun WorkShop 5.0 C compiler set to 04 optimization with the -fast option turned on. The client code was compiled with gcc set to the -03 optimization option.

TNF Instrumentation

To get accurate performance measurements, the server code was instrumented with TNF instrumentation. Turning on the TNF_DEBUG stubs provided more information about wait latencies.

Latency Test Run

To capture wait latencies, such as system calls, semaphore waits, mutex_lock and unlock, and queue slot retrieval, the application was first run with the prex command in kernel mode with thread and system call-tracing enabled. The pfilter option was used to filter server-related system events. Then prex was run again to collect the application probe timings.

Each test was run for a 30 second interval and a kernel-application TNF snapshot was taken. Since neither ktrace or tnftrace provide the option to take application as well as system snapshots, the supplied ktrace script was modified to do this. tnfdump was used to convert the binary TNF snapshots into text tables. A custom Perl script, tnfex.pl, calculated the different latencies from this table (similar to tnfview without the GUI). This latency table was converted into a flat file database using a makelatdb.pl script, and analyzed using a latdb_analyze.pl script to calculate the percentages and tabulate the relationships between the different latencies. This collection process was automated using shell scripts, .rhosts, and rsh to automatically run and collect the TNF dumps for various processor and thread configurations.

Scaling Test Run

The automated process was again used to collect the various data but without the TNF traces. A consistent set of timings was collected for different processor and thread configurations.

Scaling Tests

To test the linear scaling, the server was run with configurations of one CPU, two CPUs, four CPUs, and a processor set with three CPUs (interrupts disabled for the set). For each processor configuration, the server was started with a different thread configuration, such as one recvPOOL thread, one servicePOOL thread, and one respPOOL thread, with the threads running in process or system scope in TS or RT class. The tests were run for 120 seconds each, and at the end the getstats utility was used to obtain the server statistics. The server was restarted for each run.

Test -Client load - 2020222 mgs, Server msgQ size - 2020222

i. Client Statistics

Total msgs sent Msg Size Avg time to send msgs in secs Network packet size Network utilization Pkts/Sec 2020222 256 49 298 98.28998 ~41200

Back to Top


ii. Server Statistics

The top three configurations with the recvPOOL thread running in RT class and without RT are shown below. The remaining configurations are shown in Appendix B.

E450 with One CPU (Three CPUs offline)

The test configuration key used in the tables below is 0|1|process|r1r1s1w1|rcv_rt, where:

0 1 process r1r1s1w1 rcv_rt Processor set CPUs online Thread scope recvPOOL threads respPOOL threads servPOOL threads workerPOOL threads recv POOL threads in RT class 0-No processor set
1-processor set # # of CPUS online system - System scope
process - Process scope r1 - 1 recvPOOL thread r1 - 1 respPOOL thread s1 - 1 servPOOL thread w1 - 1 workerPOOL thread rcv_rt - in RT class
none - in TS class

test # test configuration Rcv msgs snd msgs total msgs %rcv %snd % throughput 1 0|1|process|r1r1s1w1|rcv_rt 457809 457809 915618 22.66132 22.66132 22.66132 2 0|1|process|r1r1s1w1|rcv_rt 457809 457809 915618 22.66132 22.66132 22.66132 3 0|1|process|r2r2s4w4|rcv_rt 453563 453563 907126 22.45115 22.45115 22.45115 4 0|1|process|r1r1s1|none 159822 159822 319644 7.911111 7.911111 7.911111 5 0|1|process|r1r1s1|none 159822 159822 319644 7.911111 7.911111 7.911111 6 0|1|process|r2r2s2w2|none 148208 148208 296416 7.336223 7.336223 7.336223

graph 1
(click to enlarge image)

Back to Top


E450 with Two CPUs (Two CPUs offline)

test # test configuration Rcv msgs snd msgs total msgs %rcv %snd % throughput 1 0|2|sys|r1r1s1|rcv_rt 1891132 1444167 3335299 93.61011 71.48556 82.5478 2 0|2|process|r1r1s1w1|rcv_rt 1831189 896594 2727783 90.64296 44.38096 67.512 3 0|2|process|r1r1s1w2|rcv_rt 1826721 1176023 3002744 90.4218 58.21256 74.3172 4 0|2|sys|r2r2s2w2|none 938554 938554 1877108 46.45796 46.45796 46.458 5 0|2|sys|r1r1s1w1|none 860527 860527 1721054 42.59567 42.59567 42.5957 6 0|2|sys|r2r2s2|none 856792 856792 1713584 42.41078 42.41078 42.4108

graph 2
(click to enlarge image)

Back to Top


E450 with Four CPUs

test # test configuration Rcv msgs snd msgs total msgs %rcv %snd % throughput 1 0|4|process|r1r1s1|rcv_rt 2020195 2020195 4040390 99.99866 99.99866 99.99866 2 0|4|process|r1r1s1w1|rcv_rt 2019963 1034030 3053993 99.98718 51.18398 75.58558 3 0|4|process|r1r1s1w2|rcv_rt 2011946 1599451 3611397 99.59034 79.17204 89.38119 4 0|4|process|r1r1s1w2|none 2005388 1587301 3592689 99.26572 78.57062 88.91817 5 0|4|process|r1r1s1|none 1975966 1975966 3951932 97.80935 97.80935 97.80935

graph 3
(click to enlarge image)

Back to Top


Processor Set with Three CPUs (CPUs 1,2,3), interrupts disabled, CPU 0 handles interrupts

test # test configuration Rcv msgs snd msgs total msgs %rcv %snd % throughput 1 1|3|process|r1r1s1|rcv_rt 2020209 2020209 4040418 99.99936 99.99936 99.99936 2 1|3|process|r1r1s1w1|rcv_rt 2020198 1175111 3195309 99.99881 58.16742 79.08312 3 1|3|process|r1r1s1w2|rcv_rt 1984786 1746591 3731377 98.24594 86.4554 92.35067 4 1|3|process|r1r1s1|none 1806390 1806390 3612780 89.41542 89.41542 89.41542 5 1|3|sys|r1r1s1w1|none 1747759 1747759 3495518 86.51321 86.51321 86.51321 6 1|3|sys|r1r1s1|none 1722285 1722285 3444570 85.25226 85.25226 85.25226

graph 4
(click to enlarge image)

Back to Top


Linear Scaling with recvPOOL Threads in RT Class

The table below represents the best performing configurations with RT under different CPUs from the preceding tests. The results have been graphed for scalability.

test # test configuration Rcv msgs snd msgs total client msgs throughput 1 0|1|process|r1r1s1w1|rcv_rt 457809 457809 2020222 457809 2 0|2|sys|r1r1s1|rcv_rt 1891132 1444167 2020222 1667650 3 0|4|process|r1r1s1|rcv_rt 2020195 2020195 2020222 2020195 4 1|3|process|r1r1s1|rcv_rt 2020209 2020209 2020222 2020209

graph 5
(click to enlarge image)

Back to Top


Linear scaling with threads in TS class

The table below represents the best performing configurations with no RT under different CPUs from the preceding tests, and has been graphed for scalability.

test # test configuration Rcv msgs snd msgs total client msgs throughput 1 0|1|process|r1r1s1|none 159822 159822 2020222 159822 2 0|2|sys|r2r2s2w2|none 938554 938554 2020222 938554 3 1|3|process|r1r1s1|none 1806390 1806390 2020222 1806390 4 0|4|process|r1r1s1|none 1975966 1975966 2020222 1975966

graph 6
(click to enlarge image)

Back to Top


Latency Analysis

The server was run with different thread and processor configurations under prex for 30 seconds, and a kernel and application TNF snapshot was taken. This snapshot was then tabulated and analyzed for various latencies. The best performance numbers for each test were then interpolated into this tabulated data and were sorted by throughput (descending), total latency (ascending), and test configuration to determine whether the best performing configurations had very low latency. The best ten of these configurations are shown below, and the remainder are available in Appendix A. The highest performing configurations shown in the scaling tables above are reflected in the top ten configurations.

Column Key:

Syscall_count - Number of syscalls
Syscall_lat - Average latency of each call
Sema_wait_lat - Average sema_wait( operation
Total_latency - Wait latencies plus service time

graph7table
(click to enlarge image)

graph 7
(click to enlarge image)

Back to Top


Observations
  1. The best performance results are found in configurations with low latency. When the TNF data is sorted by total delay latency (sum of wait latencies), a few other configurations show up at the top, but the throughput is not the highest and the syscall count is on the higher side.
  2. As shown in Appendix A, the TNF table, the processing stage wait latency is high for configuration with one worker thread and decreases linearly with an increase in the number of worker threads. Configuration with one servicePOOL and one workerPOOL thread are tightly coupled. In fact, good throughput is seen with one servicePOOL and four workerPOOL threads. Analyzing further, the current implementation actually functions a bit like the scheduler in that it tries to wake up one available worker thread. It seemed that if the worker threads were busy, the transient demandPOOL threads could lower this latency. However, an unrestricted demandPOOL can easily overwhelm system resources by starting -n number of transient threads. This implementation could be changed by adding a worker stage with a worker message queue to remove the tight coupling, which might improve the performance.
  3. Use of the atomic library reduces the wait latency, as it is extremely fast. The wait latency to obtain a slot is very low.
  4. The system seems to perform well with processor sets. A bigger system with more CPUs in the set could yield very good numbers.
  5. With system-scope threads, the synchronization primitives are initialized for system scope. Better numbers might be realized with process-scope primitives if all the stages are run in-process.
  6. The server seems to perform well with system-scope threads for two-CPU configurations but not as well with more CPUs. This could not be analyzed as the required data was not captured.
  7. The RT class should not be used for two-CPU and one-CPU configurations as it might crash the whole system due to priority-inversion problems.
  8. The atomic library supports incrementing signed integers, so the counters will become negative at some point. Therefore, an unsigned version of this library is needed.
  9. To maintain the high-throughput at this peak load for sustained sessions, the server process priority needs to be modified as the time-quanta is adjusted based on CPU usage under the TS class. This needs more investigation.
  10. The binary semaphore could be replaced with a mutex.
  11. The performance measurements were done with no operating system or network tuning. The netstat set to -k qfe2 shows nocanput count to be on the higher side. So setting sq_max_size to a higher value like 100+ could improve performance.
  12. Compiling the code with the new WorkShop 6.1 compiler with different optimization switches might also yield better throughput.

Back to Top


Conclusion

A high throughput can be achieved by reducing wait-latencies and introducing queues to decouple dependencies. Even though the current data reveals that low latency improves performance, a more detailed and exhaustive analysis may be needed to accurately predict a sustainable high-performance under different thread-processor configurations.

Appendixes

See Appendix A (part1, part2) for latency analysis data, and Appendix B for performance data. You can also download the source, test scripts and analysis data.

Acknowledgements

I would like to thank my colleagues, S.R. Venkatraman and Bob Palowoda, for helping me with accesss to resources and providing expert advice on issues related to this project.

References
  1. Voice over IP Fundamentals, Jonathan Davidson, James Peters
  2. Speed Library, Nagendra Nagarajayya, S.R. Venkatramanan
  3. Inside Solaris columns, www.sunworld.com , James Mauro
  4. Solaris Internals, Richard McDougall, James Mauro
  5. Solaris documents, docs.sun.com
  6. The atomics library, www.netwiz.net/~mbennett/atomics, Mike Bennet
  7. Multi-threading programming with PTHREADS, Bill Lewis and Daniel J. Berge

January 2001

Rate and Review Tell us what you think of the content of this page. Excellent   Good   Fair   Poor   Comments:
If you would like a reply to your comment, please submit your email address:
Note: We may not respond to all submitted comments.
Close    To Top
  • Prev Article-OS:
  • Next Article-OS: None
  • Now: Tutorial for Web and Software Design > OS > Solaris > OS Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction