Standard Library, STL and Thread Safety
Standard Library, STL and Thread Safety    

April 2000

Summary

This article introduces you to STL standard library and thread-safety issues when the standard library is used in a multi-threaded mode. It explains how a current implementation, which uses coarser granularity locks, causes performance penalty. The penalty is assessed using a simple test that creates string objects in a multi-threaded mode. This article also describes an alternative locking technique and a possible future enhancement that can improve performance and compares their relative merits to the current mutex based critical sections. Finally, it explains how to use the workshop compilers with other standard libraries that are currently available.

Introduction

The Sun WorkShop 5.0 compiler currently provides a C++ standard library in archive form (libCstd.a). You can use this library to develop threaded applications. The standard library is based on Rogue Wave standard library version 2.01.01. Rogue Wave documentation on thread safety states that:

If you use the library in a multi-threading environment (and you make use of multiple threads) then you will need to compile with _RWSTD_MULTI_THREAD defined and link to a version of the library built with the option -DTHREAD=MULTI (rwstdmt.lib). This will give you a multi-thread safe version of the library. The multi-thread safe version of the library makes sure that the internals of the library all work properly in a multi-threading environment. You will still need to lock around any library objects that you share between threads (except for iostreams and locale objects).

For example, if you instantiate a string and create a new thread and you want to pass that string to the thread by reference, you will need to lock around write access to that string because you are explicitly sharing the one string object between threads. The library provides a mutex class (mutex_lock and mutex_unlock from the thread library) to accomplish this task.

On the other hand, if you pass the string to the new thread by value, then you will not need to lock, even though the string in the two different threads may share a representation through Rogue Wave's "copy on write" technology. The library handles the locking automatically and provides thread-safety at a finer granularity level via a separate reference counting class (for example, string_ref for strings.) In other words, the reference counting is updated automatically in a thread safe manner. Hence you are only required to lock when making an object available to multiple threads explicitly, either by passing references between threads or by using global or static objects.

Back to Top


Copy on Write and Reference Counting

Classes (for example, string) use a technique called copy on write to minimize copying. This technique offers the advantage of easy-to-understand value semantics with the speed of reference counted pointer implementation.

This section illustrates how the technique works. When you initialize a String with another String via the copy constructor:

String(cont RWCString&);

The two strings share the same data until one of them tries to write to it. At that point, a copy of the data is made, and the two strings go their separate ways. Copying only at "write" time makes copies of strings, particularly read-only copies, very inexpensive.

The following example shows how four objects share one copy of a string until one of the objects attempts to change the string:

#include <string.h>

String g;                                     // Global object
void setGlobal(String x) { g = x; }
main(){
  String a("kernel");                         // 1
  String b(a);                                // 2
  String c(a);                                // 3

  setGlobal(a);    // Still only one copy of "kernel"!  //4

  b += "s";        // Now b has its own data: "kernels" //5
}

//1 Object a is the actual allocation and initialization of the memory to hold the string kernel. //2-//3 When objects b and c are created from a, they merely increment a reference count in the original data and return. At this point, there are three objects that look at the same piece of data. //4 The function setGlobal() sets the value for g, the global string to the same value. Now the reference count is up to four, and there is still only one copy of the string kernel. //5 Finally, object b tries to change the value of the string. It looks at the reference count and sees that it is greater than one, which implies that the string is being shared by more than one object. At this point, a clone of the string is made and modified. The reference count of the original string drops back down to three, and the reference count of the newly cloned string is one.

The counter for the reference increments and decrements under the protection of a lock.

Back to Top


LEVEL 2 Thread_safety

The library is safe when only pass by value is used. So if you pass std::strings by reference, you will need to lock access and enforce one-thread-at-a-time access yourself. Take the following example:

std::string str = "A string";
a()
{

   ...
   thr_create(NULL, 0, b, (LPVOID)NULL, 0,&tid);
   str = "";
}

b()
{
   std::string strcopy = str;
}

If both assignments happen at the same time, the first one can end up in the reference by going to 0, thus freeing the memory. Meanwhile, the second assignment can see the reference at 1 and decides to add a reference to a now deallocated replica. This example demonstrates that even a simple operation such as "=" should have a lock to prevent a race condition. In other words, every accessible member function should be guarded and such can result in severe performance penalty.

To access the performance implication, a level_2 thread safety model was implemented for string class. The header files and string class member were modified to accommodate one thread at a time access. Take a look at a sample implementation:

#if defined (_RWSTD_MULTI_THREAD) && defined
(_RWSTD_STRING_MT_SAFETY_LEVEL_2)
# define MT_GUARD(name) _RWSTDGuard name (this->__mutex)
#else
# define MT_GUARD(name) ((void)0)
#endif // _RWSTD_MULTI_THREAD

When level_2 thread safety is defined, Rogue Wave's STDGuard method gets invoked, which defaults to a void.

The actual implementation for the operator "=" method looks like the following:

template <class charT, class traits, class Allocator > |
basic_string<charT, traits, Allocator> &
basic_string<charT, traits, Allocator>::operator= (
  const basic_string<charT, traits, Allocator>& str)
{
 if (this != &str)
 {
  MT_GUARD(guard);
  if (str.__pref()->__references() > 0)
  {
   str.__pref()->__addReference();
   __unLink();
   __data_ = str.__data_.data();
  }
   else
    this->operator=(str.c_str());
 }
 return *this;
}

As you can see, the locking is introduced at a much coarser level that is thus serializing the entire invocation.

Back to Top


Atomic updates for reference counting

Another optimization uses atomic updates for counters instead of mutex. Level 2 type guarding at coarser grain shows that the critical section is much larger. It also shows that mutex based guard overheads are small in comparison to the amount of work done in the critical section. However, reference counting, and counter increment and decrement operations are more efficient using atomic updates because the overheads from calling mutex methods can be much higher.

This section discusses the two different optimizing techniques. One uses a more general swap based on spin blocking while the other uses compare and swap based updates. You can use spin blocking as a general replacement for mutexes. The CAS based Fetch_and_Add is more specific to counter increments and decrements and requires no guarding.

Swap based spin blocking for guard methods are shown below and these methods were implemented in regular header file in a transparent manner.

In the following example, SWAPINITLOCK, SWAPLOCK and SWAPUNLOCK macros are called in the place of pthread_mutex_init, pthread_mutex_lock and pthread_mutex_unlock:

extern "C" Test_and_Set(long *,long);
#define SWAPINITLOCK(lp)        (*(lp) = 0)
#define SWAPLOCK(lp)            while (Test_and_Set(lp, 1L));
#define SWAPUNLOCK(lp)          Test_and_Set(lp,0L)

The Test_and_Set is implemented as a inline function using swap:

.inline Test_and_Set,8
mov %o0,%o2
mov %o1,%o0
swap [%o2],%o0

.end

In CAS based updates, _RWSTD_MT_INCREMENT macro is defined to be a Fetch_and_Add inline assembler function:

.inline fetch_and_add,4
retry:
ld [%o0],%l0
add %l0,%o1,%l1
cas [%o0],%l0,%l1
cmp %l0,%l1
bne retry
mov %l1,%o0
nop

.end

Back to Top


Performance Measurement

A simple multi-threaded server was designed to test and compare performances. For each case, a new library (libCstd.a) was built and timing measurements were done under the same condition on the same system. The ultra 60 repeatedly created string objects in a mt_mode.

No. of threads No. of objects method level_2 elapsed time 2 10,000 mutex off 31 sec 2 10,000 mutex on 41 sec   2 10,000 swap/block off 26 sec 2 10,000 swap/block on 35.7 sec   2 10,000 CAS/non-blkg off 25 sec   8 10,000 mutex off 128 sec 8 10,000 mutex on 168 sec   8 10,000 swap/block off 153 sec 8 10,000 swap/block on 170 sec   8 10,000 CAS/non-blkg off 105 sec

The tests conclude that coarser grain locks (level 2) produce 25-30% performance penalty and is more efficient when objects are shared by value among threads than by reference. Furthermore, the atomic swap based updates show performance benefits only when the number of CPUs on the system is greater or equal to the number of concurrent servers. When the number of servers exceed the amount of available CPUs, swap based spin blocking is inefficient and shows marked performance degradation.

A new technique that completely avoids guard class for counter increments and decrements has been implemented. It uses SPARC -v9 CAS instruction for atomic update of the reference counting. This technique by far shows the best performance of all schemes and the results on 8 CPUs show up to 40-50% in performance benefit.

This table will be updated as soon as tests are completed.

Conclusion

A study was performed using a multi-threaded server to assess the performance of standard library using different levels of synchronization. Present study shows the advantage of providing users with the ability to have a pluggable STL and standard library so that they can tune their requirements. Current evaluation of other STL implementations has been successful in integrating with Workshop compilers.

April 2000

Back to Top


Rate and Review Tell us what you think of the content of this page. Excellent   Good   Fair   Poor   Comments:
If you would like a reply to your comment, please submit your email address:
Note: We may not respond to all submitted comments.
Close    To Top
  • Prev Article-OS:
  • Next Article-OS:
  • Now: Tutorial for Web and Software Design > OS > Solaris > OS Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction