TITLE: multithreaded reference counting concerns

(Newsgroups: comp.lang.c++.moderated, 26 Oct 98)

AUTHOR: herbs@cntc.com (Herb Sutter)

.--------------------------------------------------------------------.
|  Guru of the Week problems and solutions are posted regularly on   |
|   news:comp.lang.c++.moderated. For past problems and solutions    |
|            see the GotW archive at http://www.cntc.com.            |
| Is there a topic you'd like to see covered? mailto:herbs@cntc.com  |
`--------------------------------------------------------------------'
_______________________________________________________

GotW #45:   Reference Counting - Part III

Difficulty: 9 / 10
_______________________________________________________


-------------------------------------------------------

Congratulations to Nathan Myers, the Guru of the Week!

   Last month, in comp.std.c++, I asserted that using
   copy-on-write (COW) in a thread-safe way necessarily
   incurred performance penalties of an order of
   magnitude or two on popular platforms, making COW
   decidedly a pessimization.  Thanks to Nathan for
   correctly pointing out that, if atomic integer
   operations are available, this performance overhead
   can be reduced.

   Unfortunately, even with atomic integer operations,
   a thread-safe COW implementation is still
   substantially slower than a non-COW implementation,
   and COW indeed remains in general a "false
   optimization" in code that might be used in a
   multithreaded environment.  The worst part is that
   the performance hit applies even if the code is
   actually used in only a single-threaded program.

-------------------------------------------------------


Introduction
------------

Standard C++ is silent on the subject of threads.
Unlike Java, C++ has no built-in support for threads
and gives no consideration to thread safety issues.  So
why a GotW issue on threads?  Simply because more and
more of us are writing multithreaded (MT) programs, and
no discussion of reference-counted String
implementations is complete if it does not cover thread
safety issues.

I won't go into a long discussion about what's involved
with writing thread-safe code in detail; see a good
book on threads for more details.


>JG Question
>-----------
>
>Why is Optimized::String not thread-safe?  Give examples.

It is the responsibility of code that uses an object to
ensure that access to the object is serialized as
needed.  For example, if a certain String object could
be modified by two different threads, it's not the poor
String object's job to defend itself from abuse; it's
the job of the code that uses the String to make sure
that two threads never try to modify the same String
object at the same time.

The problem with the code in GotW #44 is twofold:
First, the reference-counted (copy-on-write, COW)
implementation is designed to hide the fact that two
different visible String objects could be sharing a
common hidden state; hence it's the String class's
responsibility to ensure that the calling code never
modifies a String whose representation is shared.  The
String code shown already does that, by performing a
deep copy ("un-sharing" the representation) if
necessary the moment a caller tries to modify the
String.  This part is fine in general.

Unfortunately, it now brings us to the second problem:
The code in String that "un-shares" the representation
isn't thread-safe.  Imagine that there are two String
objects s1 and s2:

a) s1 and s2 happen to share the same representation
   under the covers (okay, because String is designd
   for this);

b) thread 1 tries to modify s1 (okay, because thread 1
   knows that no one else is trying to modify s1);

c) thread 2 tries to modify s2 (okay, because thread 2
   knows that no one else is trying to modify s2);

d) at the same time (error)

The problem is (d):  At the same time, both s1 and s2
will attempt to "un-share" their shared representation,
and the code to do that is is not thread-safe.
Specifically, consider the very first line of code in
String::AboutToModify():

    void String::AboutToModify(
      size_t n,
      bool   bMarkUnshareable /* = false */
    ) {
      if( data_->refs > 1 && data_->refs != Unshareable ) {
        /* ... etc. ... */

This if-condition is not thread-safe.  For one thing,
evaluating even "data_->refs > 1" may not be atomic; if
so, it's possible that if thread 1 tries to evaluate
"data_->refs > 1" while thread 2 is updating the value
of refs, the value read from data_->refs might be
anything -- 1, 2, or even something that's neither the
original value nor the new value.  The problem is that
String isn't following the basic thread-safety
requirement that code that uses an object must ensure
that access to the object is serialized as needed.
Here, String has to ensure that no two threads are
going to use the same "refs" value in a dangerous way
at the same time.  The traditional way to accomplish
this is to serialize access to the shared StringBuf (or
just its .refs member) using a critical section or
semaphore.  In this case, at minimum the "comparing an
int" operation must be guaranteed to be atomic.

This brings us to the second issue:  Even if reading
and updating "refs" were atomic, there are two parts to
the if condition.  The problem is that the thread
executing the above code could be interrupted after it
evaluates the first part of the condition but before it
evaluates the second part.  In our example:

    Thread 1                          Thread 2

    >> enter s1's AboutToModify()

    evaluate "data_->refs > 1"
    (true, because data_->refs is 2)

                         ******** context switch ********

                                      >> enter s2's AboutToModify()

                                      (runs all the way to completion,
                                      including that it decrements
                                      data_->refs to the value 1)

                                      << exit s2's AboutToModify()

                         ******** context switch ********

    evaluate "data_->refs != Unshareable"
    (true, because data_->refs is now 1)

    enters AboutToModify's "I'm shared and
    need to unshare" block, which clones
    the representation, decrements
    data_->refs to the value 0, and gives
    up the last pointer to the StringBuf...
    poof, we have a memory leak because
    the StringBuf that had been shared by
    s1 and s2 can now never be deleted

Having covered that, we're ready to see how to solve
these safety problems.


>Guru Questions
>--------------
>
>1. Demonstrate how to make Optimized::String thread-safe:
>
>    a) assuming that there atomic operations to get,
>        set, and get-and-set integer values; and

>    b) assuming that there aren't.

I'm going to answer b) first because it's more general.
What's needed here is a lock-management device like a
critical section or a mutex. The code is equivalent in
either case, so below I'll use a critical section,
which is usually a more efficient synchronization
device than a general-purpose mutex. The key to what
we're about to do is quite simple: It turns out that if
we do things right we only need to lock access to the
reference count itself.

Before doing anything else, we need to add a
CriticalSection member object into
Optimized::StringBuf.  Let's call the member cs:

  namespace Optimized {

    struct StringBuf {
        StringBuf();              // start off empty
       ~StringBuf();              // delete the buffer
        StringBuf( const StringBuf& other, size_t n = 0 );
                                  // initialize to copy of other,
                                  //  and ensure len >= n

        void Reserve( size_t n ); // ensure len >= n

        char*    buf;             // allocated buffer
        size_t   len;             // length of buffer
        size_t   used;            // # chars actually used
        unsigned refs;            // reference count
        CriticalSection cs;       // serialize work on this object
    };

The only function that necessarily operates on two
StringBuf objects at once is the copy constructor.
String only calls StringBuf's copy constructor from two
places (from String's own copy constructor, and from
AboutToModify()).  Note that String only needs to
serialize access to the reference count, because by
definition no String will do any work on a StringBuf
that's shared (if it is shared, it will be read in
order to take a copy, but we don't need to worry about
anyone else trying to change or Reserve() or otherwise
alter/move the buffer).

The default constructor needs no locks:

    String::String() : data_(new StringBuf) { }

The destructor need only lock its interrogation and
update of the refs count:

    String::~String() {
      bool bDelete = false;
      data_->cs.Lock(); //---------------------------
      if( data_->refs == Unshareable || --data_->refs < 1 ) {
        bDelete = true;
      }
      data_->cs.Unlock(); //-------------------------
      if( bDelete ) {
        delete data_;
      }
    }

For the String copy constructor, note that we can
assume that the other String's data buffer won't be
modified or moved during this operation, since it's the
responsibility of the caller to serialize access to
visible objects.  We must still, however, serialize
access to the reference count itself, as we did above:

    String::String( const String& other )
    {
      bool bSharedIt = false;
      other.data_->cs.Lock(); //---------------------
      if( other.data_->refs != Unshareable ) {
        bSharedIt = true;
        data_ = other.data_;
        ++data_->refs;
      }
      other.data_->cs.Unlock(); //-------------------

      if( !bSharedIt ) {
        data_ = new StringBuf( *other.data_ );
      }
    }

So making the String copy constructor safe wasn't very
hard at all.  This brings us to AboutToModify(), which
turns out to be very similar, but notice that this
sample code actually acquires the lock during the
entire deep copy operation... really, the lock is
strictly only needed when looking at the refs value,
and again when updating the refs value at the end, but
I decided to lock the whole operation instead of
getting slightly better concurrency by releasing the
lock during the deep copy and then reacquiring it just
to update refs:

    void String::AboutToModify(
      size_t n,
      bool   bMarkUnshareable /* = false */
    ) {
      bool bCopiedIt = false;
      data_->cs.Lock(); //---------------------------
      if( data_->refs > 1 && data_->refs != Unshareable ) {
        bCopiedIt = true;
        StringBuf* newdata = new StringBuf( *data_, n );
        --data_->refs;   // now all the real work is
        data_ = newdata; //  done, so take ownership
      }
      data_->cs.Unlock(); //-------------------------

      if( !bCopiedIt ) {
        data_->Reserve( n );
      }
      data_->refs = bMarkUnshareable ? Unshareable : 1;
    }

None of the other functions need to be changed.
Append() and operator[]() don't need locks because once
AboutToModify() completes we're guaranteed that we're
not using a shared representation.  Length() doesn't
need a lock because by definition we're okay if our
StringBuf is not shared (there's no one else to change
the used count on us), and we're okay if it is shared
(the other thread would take its own copy before doing
any work and hence still wouldn't modify our used count
on us):

    void String::Append( char c ) {
      AboutToModify( data_->used+1 );
      data_->buf[data_->used++] = c;
    }

    size_t String::Length() const {
      return data_->used;
    }

    char& String::operator[]( size_t n ) {
      AboutToModify( data_->len, true );
      return *(data_->buf+n);
    }

  }

Again, note the interesting thing in all of this:  The
only locking we needed to do involved the refs count
itself.

With that observation and the above general-purpose
solution under our belts, let's look back to the a)
part of the question:


>    a) assuming that there atomic operations to get,
>        set, and get-and-set integer values;

Some operating systems provide these kinds of
functions.

NOTE:  These functions are usually much more efficient
   than general-purpose synchronization primitives like
   critical sections and mutexes.  It is, however, a
   fallacy so say that we can use atomic integer
   operations "instead of locks" because locking is
   still required -- the locking is just generally less
   expensive than other alternatives, but it's not free
   by a long shot.

Here is a thread-safe implementation of String that
assumes we have three functions:  an IntAtomicGet, and
IntAtomicDecrement and IntAtomicIncrement that safely
return the new value.  We'll do essentially the same
thing we did above, but use only atomic integer
operations to serialize access to the refs count:

  namespace Optimized {

    String::String() : data_(new StringBuf) { }

    String::~String() {
      if( IntAtomicGet( data_->refs ) == Unshareable ||
          IntAtomicDecrement( data_->refs ) < 1 ) {
        delete data_;
      }
    }

    String::String( const String& other )
    {
      if( IntAtomicGet( other.data_->refs ) != Unshareable ) {
        data_ = other.data_;
        IntAtomicIncrement( data_->refs );
      }
      else {
        data_ = new StringBuf( *other.data_ );
      }
    }

    void String::AboutToModify(
      size_t n,
      bool   bMarkUnshareable /* = false */
    ) {
      int refs = IntAtomicGet( data_->refs );
      if( refs > 1 && refs != Unshareable ) {
        StringBuf* newdata = new StringBuf( *data_, n );
        if( IntAtomicDecrement( data_->refs ) < 1 ) {
          delete newdata;  // just in case two threads
        }                  //  are trying this at once
        else {             // now all the real work is
          data_ = newdata; //  done, so take ownership
        }
      }
      else {
        data_->Reserve( n );
      }
      data_->refs = bMarkUnshareable ? Unshareable : 1;
    }

    void String::Append( char c ) {
      AboutToModify( data_->used+1 );
      data_->buf[data_->used++] = c;
    }

    size_t String::Length() const {
      return data_->used;
    }

    char& String::operator[]( size_t n ) {
      AboutToModify( data_->len, true );
      return *(data_->buf+n);
    }

  }


>2. What are the effects on performance?  Discuss.

In a word:  Bad.  Without atomic integer operations,
copy-on-write typically incurs a performance slowdown
of an order of magnitude or two.  Even with atomic
integer operations, COW can make common String
operations take nearly 50% longer -- even in
single-threaded programs.

In general, copy-on-write is a bad idea in
multithread-ready code.  In short, the reason is that
the calling code can no longer know whether two
distinct String objects actually share the same
representation under the covers, and so String must
incur overhead to do enough serialization that calling
code can take its normal share of responsibility for
thread safety.  Depending on the availability of
more-efficient options like atomic integer operations,
the impact on performance ranges from "moderately
heavy" to "profound."


Some Empirical Results
----------------------

Enough theory; here are some actual test cases.  I put
four versions of String/StringBuf through a test
harness that primarily calls String::operator[] in a
timed loop.  The four versions I tested are:

    Name           Code Used
    -------------  ------------------------------------
    Unsafe COW     GotW #44 answer
    CritSec COW    GotW #45 1(b) answer above
    Mutex COW      GotW #45 1(b) answer, with Critical-
                     Section replaced by Mutex
    AtomicInt COW  GotW #45 1(a) answer, above

All results were compared to Unsafe COW.  In short,
AtomicInt COW averaged 44% slower in this particular
test code, CritSec COW averaged nearly 8 times slower,
and Mutex COW averaged 46 times slower.  (Under
Windows, a critical section is more efficient than a
mutex, because the latter offers more services and can
also be used to synchronize across processes.)  I will
make the test code available on the GotW website.

    # Output of two runs of 1,000,000 iterations each.
    # The test harness was compiled with MSVC 5.0 SP3.

      Unsafe COW:       6794 ms      100%
      CritSec COW:     53436 ms      786%
      Mutex COW:      317174 ms     4668%
      AtomicInt COW:    9806 ms      144%

      Unsafe COW:       6798 ms      100%
      CritSec COW:     53464 ms      786%
      Mutex COW:      316900 ms     4661%
      AtomicInt COW:    9805 ms      144%


A few important points about this test harness:

0. CAVEAT LECTOR:  Take this for what it is:  A first
   cut at a test harness.  Comments and corrections are
   welcome.  I'm showing raw performance numbers here;
   I haven't inspected the compiled code, and I've made
   no attempt to determine the impact of cache
   hits/misses and other secondary effects.  (Even so,
   this GotW took much more effort than usual to
   produce, and I guarantee that the next few issues
   will feature simpler topics!)

1. The test harness is designed to exercise possibly-
   mutating operations, but it is otherwise designed to
   be as favourable as possible to the locking
   operations:  it operates only on unshared strings,
   it does not actually modify the strings, and there
   is no lock contention because there's only one
   thread (so it never has one thread waiting while
   another is performing an atomic operation).  In
   other words, this code isn't trying to unfairly
   stack the deck against the locking operations in an
   attempt to magnify the performance difference.

2. Inside the timed loop, the code does not modify the
   string; that is, operator[] is called but the
   returned reference is not used to change the string.
   This demonstrates that the overhead is indeed
   incurred "for every possibly mutating operation
   whether the string is shared or not," and not just
   "for every mutating operation on a shared string" as
   is sometimes suggested.  Note that operator[] is
   just like the standard begin() and end() in this
   respect, since all three functions return an
   iterator or reference that can later be used to
   modify the string.

3. TANSTAAFL ("there ain't no such thing as a free
   lunch" -R.A.Heinlein). Even with atomic integer
   operations, it is incorrect to say "there's no
   locking required" because the atomic integer
   operations clearly do perform serialization, and do
   incur significant overhead.

4. The test harness itself is SINGLE-threaded.  A
   thread-safe COW implementation incurs overhead even
   in programs that are not themselves multithreaded.
   At best, COW could be a true optimization only when
   the COW code does not need to be made thread-safe at
   all (even then, see Rob Murray's "C++ Strategies and
   Tactics" book for empirical tests that show COW is
   only beneficial in certain situations).  If thread
   safety is required, COW imposes a significant
   performance penalty on all users, even users running
   only single-threaded code.

For more details, see the upcoming article on this
topic in... (later I'll announce which magazine).