//5
Finally, object b tries to change
the value of the string. It looks at the reference count and sees that
it is greater than one, which implies that the string is being shared by
more than one object. At this point, a clone of the string is made and
modified. The reference count of the original string drops back down to
three, and the reference count of the newly cloned string is one.
The counter for the reference increments and decrements under the protection
of a lock.
Back to Top
LEVEL 2 Thread_safety
The library is safe when only pass by value is used. So if you pass std::strings
by reference, you will need to lock access and enforce one-thread-at-a-time
access yourself. Take the following example:
std::string str = "A string";
a()
{
...
thr_create(NULL, 0, b, (LPVOID)NULL, 0,&tid);
str = "";
}
b()
{
std::string strcopy = str;
}
If both assignments happen at the same time, the first one can end up in
the reference by going to 0, thus freeing the memory. Meanwhile, the second
assignment can see the reference at 1 and decides to add a reference to
a now deallocated replica. This example demonstrates that even a simple
operation such as "=" should have a lock to prevent a race condition. In
other words, every accessible member function should be guarded and such
can result in severe performance penalty.
To access the performance implication, a level_2 thread safety model
was implemented for string class. The header files and string class member
were modified to accommodate one thread at a time access. Take a look at
a sample implementation:
#if defined (_RWSTD_MULTI_THREAD) && defined
(_RWSTD_STRING_MT_SAFETY_LEVEL_2)
# define MT_GUARD(name) _RWSTDGuard name (this->__mutex)
#else
# define MT_GUARD(name) ((void)0)
#endif // _RWSTD_MULTI_THREAD
When level_2 thread safety is defined, Rogue Wave's STDGuard method gets
invoked, which defaults to a void.
The actual implementation for the operator "=" method looks like the
following:
template <class charT, class traits, class Allocator
> |
basic_string<charT, traits, Allocator> &
basic_string<charT, traits, Allocator>::operator= (
const basic_string<charT, traits, Allocator>& str)
{
if (this != &str)
{
MT_GUARD(guard);
if (str.__pref()->__references() > 0)
{
str.__pref()->__addReference();
__unLink();
__data_ = str.__data_.data();
}
else
this->operator=(str.c_str());
}
return *this;
}
As you can see, the locking is introduced at a much coarser level that
is thus serializing the entire invocation.
Back to Top
Atomic updates for reference counting
Another optimization uses atomic updates for counters instead of mutex.
Level 2 type guarding at coarser grain shows that the critical section
is much larger. It also shows that mutex based guard overheads are small
in comparison to the amount of work done in the critical section. However,
reference counting, and counter increment and decrement operations are
more efficient using atomic updates because the overheads from calling
mutex methods can be much higher.
This section discusses the two different optimizing techniques. One
uses a more general swap based on spin blocking while the other uses compare
and swap based updates. You can use spin blocking as a general replacement
for mutexes. The CAS based Fetch_and_Add is more specific to counter increments
and decrements and requires no guarding.
Swap based spin blocking for guard methods are shown below and these
methods were implemented in regular header file in a transparent manner.
In the following example, SWAPINITLOCK, SWAPLOCK and SWAPUNLOCK macros
are called in the place of pthread_mutex_init, pthread_mutex_lock and pthread_mutex_unlock:
extern "C" Test_and_Set(long *,long);
#define SWAPINITLOCK(lp)
(*(lp) = 0)
#define SWAPLOCK(lp)
while (Test_and_Set(lp, 1L));
#define SWAPUNLOCK(lp)
Test_and_Set(lp,0L)
The Test_and_Set is implemented as a inline function using swap:
.inline Test_and_Set,8
mov %o0,%o2
mov %o1,%o0
swap [%o2],%o0
.end
In CAS based updates, _RWSTD_MT_INCREMENT macro is defined to be a Fetch_and_Add
inline assembler function:
.inline fetch_and_add,4
retry:
ld [%o0],%l0
add %l0,%o1,%l1
cas [%o0],%l0,%l1
cmp %l0,%l1
bne retry
mov %l1,%o0
nop
.end
Back to Top
Performance Measurement
A simple multi-threaded server was designed to test and compare performances.
For each case, a new library (libCstd.a) was built and timing measurements
were done under the same condition on the same system. The ultra 60 repeatedly
created string objects in a mt_mode.
No. of threads
No. of objects
method
level_2
elapsed time
2
10,000
mutex
off
31 sec
2
10,000
mutex
on
41 sec
2
10,000
swap/block
off
26 sec
2
10,000
swap/block
on
35.7 sec
2
10,000
CAS/non-blkg
off
25 sec
8
10,000
mutex
off
128 sec
8
10,000
mutex
on
168 sec
8
10,000
swap/block
off
153 sec
8
10,000
swap/block
on
170 sec
8
10,000
CAS/non-blkg
off
105 sec
The tests conclude that coarser grain locks (level 2) produce 25-30% performance
penalty and is more efficient when objects are shared by value among threads
than by reference. Furthermore, the atomic swap based updates show performance
benefits only when the number of CPUs on the system is greater or equal
to the number of concurrent servers. When the number of servers exceed
the amount of available CPUs, swap based spin blocking is inefficient and
shows marked performance degradation.
A new technique that completely avoids guard class for counter increments
and decrements has been implemented. It uses SPARC -v9 CAS instruction for
atomic update of the reference counting. This technique by far shows the
best performance of all schemes and the results on 8 CPUs show up to 40-50%
in performance benefit.
This table will be updated as soon as tests are completed.
Conclusion
A study was performed using a multi-threaded server to assess the performance
of standard library using different levels of synchronization. Present
study shows the advantage of providing users with the ability to have a
pluggable STL and standard library so that they can tune their requirements.
Current evaluation of other STL implementations has been successful in
integrating with Workshop compilers.
April 2000
Back to Top