Solving the Python Threading Dilemma
Python* is a powerful language especially as it pertains to AI and machine learning development. But CPython (the original, reference implementation and byte-code interpreter for the programming language) is not multithreaded; it requires support under-the-hood to enable multithreaded capabilities and parallel processing. Libraries like NumPy, SciPy, and PyTorch* utilize C-based implementations to enable some of the multi-core processing desired—the infamous Global Interpreter Lock (GIL) literally “locks” the CPython interpreter into functioning on only a single thread at a time regardless of the context, whether in a single or multi-threaded environment.
Takeaway: Python excels when executing single-threaded programs, but it suffers in situations where multithreading or multiprocessing are required or preferred.
Let’s look at Python through a different lens.
Imagine vanilla Python as a single needle and GIL as a single line of thread. With that needle and thread, a shirt is made. Its quality is amazing, but perhaps it could have been made more efficiently while maintaining that same quality. In that vein, what if we can work around that “limiter” by making Python applications parallel, such as by using libraries in the oneAPI programming model or Numba*? What if it is no longer just a needle and single thread being used to manufacture that shirt, but rather a sewing machine? And what if it is not just one but dozens or hundreds of sewing machines working together to make many of those shirts in record time?
This is the goal of the powerful libraries and tools behind Intel® Distribution of Python, a set of high-performance packages that optimize underlying instruction sets for Intel® architectures.
The Intel distribution helps developers achieve performance levels close to that of a C++ program for compute-intensive, core Python numerical and scientific packages including NumPy, SciPy, and Numba through acceleration using oneAPI libraries for math and threading operations by keeping the Python overheads low. This helps developers deliver highly efficient multithreading, vectorization, and memory management for their applications while enabling effective scaling across a cluster.1
Let’s take a deeper dive into Intel’s approach to improved composability and parallelism in Python and how it can help accelerate your AI/ML workflows.
Nested Parallelism: NumPy and SciPy
NumPy and SciPy are Python libraries specifically designed for numerical processing and scientific computing, respectively.
One workaround to enable multithreading/parallelism in Python programs is to expose parallelism on all the possible levels of a program, such as by parallelizing the outermost loops or by using other functional or pipeline types of parallelism on the application level. Libraries such as Dask, Joblib, and the built-in multiprocessing module mproc (including its ThreadPool class) can help achieve this parallelism.
With the heavy requirements of processing big data for AI and machine learning applications, data-parallelism can be achieved by using Python modules like NumPy and SciPy which can, in turn, be accelerated with an optimized math library such as the Intel® oneAPI Math Kernel Library (oneMKL). oneMKL is multi-threaded using multiple threading runtimes. The threading layer can be controlled via an environment variable, i.e., MKL_THREADING_LAYER.2
This results in a code structure where one parallel region calls a function within which lies yet another parallel region—a nested parallelism. This parallelism-within-parallelism is an efficient way to minimize or hide synchronization latencies and serial regions (i.e., regions that cannot run in parallel), which are generally unavoidable in NumPy and SciPy based programs.
A Step Further: Numba
While NumPy and SciPy provide rich mathematical and data-focused accelerations with C-extensions, they are still a fixed set of mathematical instruments accelerated with C-extensions. However, a developer might need to use non-standard math and expect it to be as fast as C-extensions. That’s where Numba can be efficiently used.
Numba acts as a “Just-In-Time” (JIT) compiler based on LLVM. It works to close the performance gap between Python and statically typed, compiled languages like C and C++. It also supports multiple threading runtimes, such as Intel® oneAPI Threading Building Blocks (oneTBB), OpenMP*, and workqueue. There are three built-in threading layers corresponding to these three runtimes. Workqueue is the only threading layer present by default, but the others can be easily installed via conda commands (e.g., $ conda install tbb). The threading layer can be set via the environment variable NUMBA_THREADING_LAYER. It is important to understand that choosing this threading layer requires one of two approaches: (1) by selecting a layer that is generally safe under various forms of parallel execution or (2) by explicitly providing the threading layer name desired (e.g., tbb). For more information, on Numba threading layers, refer to Numba official documentation.
Threading Composability
Threading composability of an application or component of an application is what dictates the effectiveness or efficiency of co-existing multi-threaded components. A "perfectly composable" component would function without detriment to its own efficiency or the efficiency of other components in the system.
Working towards such a perfectly composable threading system requires a conscious effort to avoid spawning an excessive number of threads (over-subscription), which in turn means ensuring that no component or parallel region of code can require a specific number of threads to execute (this is called "mandatory" parallelism).
The alternative is to create a form of “optional” parallelism whereby a work scheduler automates the coordination of tasks among components and parallel regions and dictates at the user-level which thread(s) the components get mapped to. Of course, since the scheduler is sharing a single thread-pool to arrange the program components and libraries around, the efficiency of that scheduler’s threading model needs to outperform the built-in scheme of high-performance libraries. Otherwise, the efficiency is lost.
Intel’s Approach to Composability & Parallelism
By using oneTBB as the work scheduler, threading composability is more easily achieved. oneTBB is an open-source, cross-platform C++ library that enables multi-core parallel processing and was designed with an eye for threading composability and optional and nested parallelism.
As part of the oneTBB version released at the time of writing, an experimental module was made available that unlocks the potential for multi-threaded performance gains in Python by enabling threading composability across multiple libraries. The acceleration comes from the enhanced threading allocation of the scheduler as discussed earlier.
oneTBB uses a Pool class that replaces the standard ThreadPool for Python. And by using monkey patching to dynamically replace or update an object at runtime, the thread pool is activated across modules without requiring any code changes. Moreover, oneTBB takes over oneMKL and, instead, enables its own threading layer to provide automatic composable parallelism with calls from the NumPy and SciPy libraries.
To test the degree by which nested parallelism can improve performance, see the code samples from the following composability demo run on a system with MKL-enabled NumPy, TBB, and symmetric multiprocessing (SMP) modules and their corresponding IPython kernels installed. IPython is a rich command shell interface for interactive computing in multiple programming languages. The demo was run with the Jupyter* Notebook extension to produce a quantitative performance comparison.
import NumPy as np
from multiprocessing.pool import ThreadPool
pool = ThreadPool(10)
When changing the kernel in Jupyter menu, the above cell needs to be re-run each time to create the ThreadPool to give the runtime results described below.3
The default Python kernel is used with the following code, which will be the same line run for each of the three trials:
%timeit pool.map(np.linalg.qr, [np.random.random((256, 256)) for i in range(10)])
One can search for the eigenvalues of a matrix using this algorithm with the default Python kernel. Activating the Python-m SMP kernel results in a significant up—to an order of magnitude improvement in runtime. An even greater improvement can be gained by using the Python-m TBB kernel.
For this composability demo, oneTBB provides the best performance because of its dynamic task scheduler, which most efficiently handles code where the innermost parallel regions cannot take full advantage of the system’s CPU and when there may be a variable amount of work to take on. The SMP approach still works great, but it is normally the top performer in cases where the workloads are more balanced and all the outermost workers have a relatively similar load.
Conclusion: Harnessing Multithreading to Accelerate AI/ML Workflows
There are many ways to improve the efficiency of AI and machine learning oriented Python programs. Harnessing the power of multithreading and multiprocessing will continue to be one of most critical avenues to accelerate AI/ML software development workflows to and beyond their limits. We also encourage you to check out Intel’s other AI Tools and Framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.
Get the Software
- Download the Intel Distribution for Python standalone or as part of the Intel® AI Analytics Toolkit.
- Download oneTBB and/or oneMKL standalone or as part of the Intel® oneAPI Base Toolkit.
Explore Python Code Samples
Acknowledgment:
We would like to thank Sergey Maidanov, Oleksandr Pavlyk, and Diptorup Deb for their contributions to the blog