Exploring Intel® Transactional Synchronization Extensions with Intel® Software Development Emulator

author-image

By

Intel® Transactional Synchronization Extensions (Intel® TSX) is perhaps one of the most non-trivial extensions of instruction set architecture introduced in the 4th generation Intel® Core™ microarchitecture code name Haswell. Intel® TSX implements hardware support for a best-effort “transactional memory”, which is a simpler mechanism for scalable thread synchronization as opposed to inherently complex fine-grained locking or lock-free algorithms. The extensions have two interfaces: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM). 

In this blog I will show how you can write your first RTM code and execute it in an emulated environment now, without waiting until the 4th generation Intel® Core™ processors become available for purchase.

Before diving in, please make sure you have a basic understanding of the new RTM instructions. I refer you to this blog as an introduction. Check out also the Intel Developer Forum’12 presentation by Ravi Rajwar & Martin Dixon discussing the details of Intel TSX implementation in Haswell hardware and a presentation by Andi Kleen on adding lock elision (also using RTM) to Linux.

My plan was to write a toy bank account processing application using popular C++ thread-unaware data structures from STL with concurrent access to bank records managed by Intel TSX. This way the implementation should be very simple, thread-safe and scalable.

Development Environment

For this experiment one needs the newest version (later than 5.3.1) of Intel® Software Development Emulator (Intel® SDE) and a compiler that can generate RTM instructions (via intrinsics or direct machine code). Please note that performance measurements with Intel SDE running RTM are of limited value because the overhead of emulating TM in software instead of using real hardware is huge, but as you will see later Intel SDE can already demonstrate important points for RTM usage for concurrency library developers and application programmers.

Since my laptop runs Windows I decided to try Intel SDE/RTM on Windows. I have chosen the C++ compiler from “Microsoft Visual Studio 2012 for Windows Desktop” (there is a free “Express” version that works for my purpose too). With a few clicks I quickly setup a console application project and included immintrin.h header the main .cpp file to use RTM intrinsics.

The Test

As a bank account structure the simple std::vector<int> from C++ standard template library has been chosen. “Accounts[i]” stores current account balance for account number i. This is very simple and popular but thread-unsafe data structure which must be protected by concurrency control mechanisms for parallel access. Usually locks/mutexes are used to limit the number of threads accessing the structure simultaneously. However, for parallel write accesses the whole data structure usually is locked exclusively even if distinct parts of it have to be updated. Intel TSX should help here since it can optimistically execute writes, and if there is no real data conflict happening, the writes are committed without serializing.

To simplify the operations on the accounts I wanted to implement an easy-to-use C++ wrapper for protecting the current C++ scope from unsafe concurrent access to the data:

{
        std::cout << "open new account" << std::endl;
        TransactionScope guard; // protect everything in this scope
        Accounts.push_back(0);
}

{
        std::cout << "open new account" << std::endl;
        TransactionScope guard; // protect everything in this scope
        Accounts.push_back(0);
}

{

        std::cout << "put 100 units into account 0" <<std::endl;
        TransactionScope guard; // protect everything in this scope
        Accounts[0] += 100; // atomic update due to RTM
}

{

        std::cout << "transfer 10 units from account 0 to account 1 atomically!" << std::endl;
        TransactionScope guard; // protect everything in this scope
        Accounts[0] -= 10;
        Accounts[1] += 10;

}

{

        std::cout << "atomically draw 10 units from account 0 if there is enough money"<< std::endl;
        TransactionScope guard; // protect everything in this scope
        if(Accounts[0] >= 10) Accounts[0] -= 10;

}

{
        std::cout << "add 1000 empty accounts atomically"<< std::endl;
        TransactionScope guard; // protect everything in this scope
        Accounts.resize(Accounts.size() + 1000, 0);
}

Legacy applications implement such guards using a lock that allows only a single writer to execute the critical section (read-write locks are more complicated to handle and also do not make much sense here in our case because all accesses are writes/updates):

class TransactionScope
{
        SimpleSpinLock & lock;
        TransactionScope(); // forbidden
public:

        TransactionScope(SimpleSpinLock & lock_): lock(lock_) { lock.lock(); }
        ~TransactionScope() { lock.unlock(); }
};

 

Implementing and Testing with RTM

A naive RTM implementation for TransactionScope (handling both read/lookup and write/update accesses transparently) would be (changed lines are marked with █):

class TransactionScope
{
public:
        TransactionScope()

{

█               int nretries = 0;

█               while(1)

█               {

█                       ++nretries;

█                       unsigned status = _xbegin();

█                       if(status == _XBEGIN_STARTED) return; // successful start

█                       // abort handler

█                       std::cout << "DEBUG: Transaction aborted "<< nretries <<

█                          " time(s) with the status "<< status << std::endl;

█               }

        }

█       ~TransactionScope() { _xend(); }

};

I have successfully compiled this code and tried to run it through Intel SDE:

./sde-bdw-external-5.31.0-2012-11-01-win/sde.exe -hsw -rtm-mode full -- ./ConsoleApplication1.exe
open new account
DEBUG: Transaction aborted 1 time(s) with the status 0
DEBUG: Transaction aborted 2 time(s) with the status 0
DEBUG: Transaction aborted 3 time(s) with the status 0
DEBUG: Transaction aborted 4 time(s) with the status 0
DEBUG: Transaction aborted 5 time(s) with the status 0
DEBUG: Transaction aborted 6 time(s) with the status 0
DEBUG: Transaction aborted 7 time(s) with the status 0
DEBUG: Transaction aborted 8 time(s) with the status 0
DEBUG: Transaction aborted 9 time(s) with the status 0
DEBUG: Transaction aborted 10 time(s) with the status 0
DEBUG: Transaction aborted 11 time(s) with the status 0
DEBUG: Transaction aborted 12 time(s) with the status 0
DEBUG: Transaction aborted 13 time(s) with the status 0
DEBUG: Transaction aborted 14 time(s) with the status 0
DEBUG: Transaction aborted 15 time(s) with the status 0
DEBUG: Transaction aborted 16 time(s) with the status 0

and so on…

The program went into infinite loop always aborting on the first transaction. The RTM debug log from Intel SDE (emx-rtm.txt) also confirmed that (used option “-rtm_debug_log 2”). Well, a general rule is that failure is more or less expected for any implementation that ignores specification… Intel® Architecture Instruction Set Extensions Programming Reference explicitly mentions that “the hardware provides no guarantees as to whether an RTM region will ever successfully commit transactionally”. Because of that the software using RTM must provide (non-transactional) fall-back path that is executed if (many) aborts are happening (By the way: HLE provides the fall-back automatically, since on the first abort, the same critical section is executed non-transactionally).

Implementing Fall-Back

Here is our second attempt that acquires a fall-back spin lock non-transactionally after specified number of retries.

// handle _xabort(0xff) from above
 if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff

      && !(status & _XABORT_NESTED))
 {
        while(fallBackLock.isLocked()) _mm_pause(); // wait until lock is free

█} else if(!(status & _XABORT_RETRY)) break; /* take the fall-back lock

    if the retry abort flag is not set */

The output:

open new account
DEBUG: Transaction aborted 1 time(s) with the status 0
open new account
put 100 units into account 0
transfer 10 units from account 0 to account 1 atomically!
atomically draw 10 units from account 0 if there is enough money
add 1000 empty accounts atomically

Now we see that the program makes faster progress by taking the fall-back lock sooner in the case of a “hard” abort.

As you may notice, the changes so far were isolated within some synchronization interface, TransactionScope. The application code was not changed. As generally available TSX software infrastructure evolves in future you should look for a proven existing library that has (scope) locks with RTM support to avoid pitfalls in your synchronization primitives (we will talk about pitfalls in applicationcode in future blogs). For example a TSX-enabled pthread library for Linux is already available. On the other hand, it is not uncommon for existing applications to use an extended or custom synchronization interfaces, converting them to take advantage of TSX is not a complicated task either if done with care.

Concurrent Accesses from Several Threads Managed by Intel TSX

After basic debugging the time has come to see the real power of Intel TSX: run two worker threads doing random concurrent updates to the central account data structure:

unsigned __stdcall thread_worker(void * arg)

{

        int thread_nr = (int) arg;

        std::cout << "Thread "<< thread_nr<< " started." << std::endl;

        // create thread-local TR1 C++ random generator from <random>

        std::tr1::minstd_rand myRand(thread_nr); 

        long int loops = 10000;
  
        while(--loops)

        {

                {

                        TransactionScope guard(globalFallBackLock);

                        // put 100 units into a random account atomically

                        Accounts[myRand() % Accounts.size()] += 100;

                }

                {

                        TransactionScope guard(globalFallBackLock);

                        /* transfer 100 units between random accounts 

                           (if there is enough money) atomically */

                        int a = myRand() % Accounts.size()

                        int b = myRand() % Accounts.size();

                        if(Accounts[a] >= 100)

                        {

                                Accounts[a] -= 100;

                                Accounts[b] += 100;

                        }

                }

        }

        std::cout << "Thread "<< thread_nr<< " finished." << std::endl;

        return 0;
}

I built Release build without DEBUG output and see that there are only about 100-300 aborts for the total of 20000 transactions. Debug output says that the abort flag status is 6: retry and “memory access conflict” bits are set. This is exactly what I expected from Intel TSX: almost all updates are done in parallel and only a few have been rolled back due to a conflict.

To double check if my conclusions are right and emulator works as I expected I added an increment/update of a global counter in the transactions to introduce a huge number of conflicting accesses. And yes, it worked: with that change I have seen about 5-15K aborts. Although the absolute numbers obtained from the RTM emulator are not able to exactly predict the execution metrics on future hardware, the orders of magnitude should still indicate possible issues with RTM usage.

Last Words

These were my experiences with RTM and the new Intel® Software Development Emulator. Get prepared for Haswell and check out how your software can use Restricted Transactional Memory with Intel SDE now!