Improving Python Threading Strategies For AI/ML Workloads

May 29, 2025

238

Solving the Python Threading Dilemma

Python is a strong language, particularly for developing Artificial Intelligence and machine learning. However, the computer language’s original, reference implementation and byte-code interpreter, CPython, is not multithreaded; it needs internal support to allow for parallel processing and multithreaded functionality. The infamous Global Interpreter Lock (GIL) literally “locks” the CPython interpreter into operating on only one thread at a time, regardless of the context, whether in a single-threaded environment or a multi-threaded one. Libraries such as NumPy, SciPy, and PyTorch use C-based implementations to enable some of the multi-core processing desired.

Let’s take a different approach to Python.

Think about GIL as a single line of thread and vanilla Python as a single needle. A garment is created using that needle and thread. Although it is of incredible quality, it may have been produced more cheaply without sacrificing quality. Accordingly, what if Intel could overcome that “limiter” by parallelizing Python programs, for example, by utilizing Numba or libraries from the oneAPI programming model? What if a sewing machine is now being used to make that garment instead of simply a needle and one thread? And what if several of those shirts are produced in record speed by dozens or even hundreds of sewing machines operating together?

Intel Distribution of Python, a collection of high-performance packages that optimize underlying instruction sets for Intel architectures, aims to do this through its robust libraries and tools.

The Intel distribution helps developers attain C++-like performance for compute-intensive Python numerical and scientific programs like NumPy, SciPy, and Numba by reducing Python overheads and accelerating math operations using oneAPI libraries. In addition to facilitating rapid cluster scalability, this aids developers in providing their applications with extremely effective multithreading, vectorization, and memory management.One

Let’s examine Intel’s strategy for enhancing Python’s parallelism and composability in more detail and see how it can speed up your AI/ML processes.

Nested Parallelism: NumPy and SciPy

Python libraries NumPy and SciPy were created especially for scientific computing and numerical processing, respectively.

Exposing parallelism on all program levels for example, by parallelizing the outermost loops or by utilizing other functional or pipeline kinds of parallelism at the application level is one workaround for enabling multithreading and parallelism in Python projects. This parallelism may be accomplished with the use of libraries like Dask, Joblib, and the included multiprocessing module mproc (with its ThreadPool class).

Data-parallelism may be accomplished with Python modules like NumPy and SciPy, which can then be accelerated with an optimized math library like the Intel oneAPI Math Kernel Library (oneMKL). This is necessary due to the high processing needs of large data for AI and machine learning applications. Multiple Python Threading runtimes are used to multi-threat oneMKL. An environment variable, MKL_THREADING_LAYER. can be used to adjust the threading layer.

This creates a code structure known as nested parallelism, in which one parallel section calls a function that contains still another parallel region. In NumPy and SciPy-based applications, synchronization latencies and serial parts that is, portions that cannot execute in parallel are typically inevitable. This parallelism-within-parallelism is an effective technique to reduce or conceal these areas.

Numba

NumPy and SciPy are a defined set of mathematical tools that are accelerated with C-extensions, even though they offer comprehensive mathematical and data-focused accelerations. But if a developer wants it to be as quick as C-extensions, they may need to apply unconventional maths. Numba can be effectively employed in this situation.

Numba and LLVM are “Just-In-Time” compilers. Reduce the performance gap between Python and statically typed languages like C language and C++. Workqueue, OpenMP, and Intel oneAPI Python Threading Building Blocks are also supported. These three runtimes are represented by the three built-in Python Threading layers. The only threading layer that is automatically present is workqueue; however, the others are simply added using conda commands (e.g., $ conda install tbb).

Set the threading layer using NUMBA_THREADING_LAYER. It is crucial to realize that there are two methods for selecting this threading layer: (1) picking a layer that is usually safe under different types of parallel processing, or (2) explicitly supplying the appropriate threading layer name (e.g., tbb). Consult the official Numba documentation for further details on Numba threading layers.

Threading Composability

The efficacy or efficiency of co-existing multi-threaded components is determined by the Python Threading composability of the application or a component of an application. A component that is “perfectly composable” would operate without compromising either its own or other system components’ efficiency.

Aiming for such a completely composable Python Threading system necessitates making a deliberate effort to prevent over-subscription, which means making sure that no parallel region of code or component can need a certain number of threads to run (a practice known as “mandatory” parallelism).

The alternative is to provide a type of “optional” parallelism in which a work scheduler determines which thread or threads the components are mapped to at the user level and automates the coordination of tasks across components and parallel areas. Naturally, the scheduler’s threading model must be more efficient than the built-in high-performance library method because it is using a single thread-pool to arrange the program’s components and libraries. Efficiency is lost otherwise.

Intel’s Approach to Composability & Parallelism

Python Threading composability is easier to accomplish when oneTBB is used as the work scheduler. Multi-core parallel processing is made possible using the open-source, cross-platform C++ library oneTBB, which was created with threading composability, optional parallelism, and layered parallelism in mind.

An experimental module that enables threading composability across various libraries, hence unlocking the potential for multi-threaded performance increases in Python, was made accessible as part of the oneTBB version that was published at the time of writing. As was previously mentioned, the scheduler’s improved Python Threading allocation is the source of the acceleration.

The conventional ThreadPool for Python is replaced with a Pool class in oneTBB. Additionally, the thread pool is active across modules without needing any code modifications by employing monkey patching to dynamically replace or update an object at runtime. Additionally, oneTBB replaces oneMKL by turning on its own Python Threading layer, which uses calls from the NumPy and SciPy libraries to offer automated composable parallelism.

The code samples from the following composability demo, which is conducted on a system with MKL-enabled NumPy, TBB, and symmetric multiprocessing (SMP) modules and their accompanying IPython kernels installed, demonstrate the extent to which nested parallelism may enhance performance. For interactive computing in a variety of programming languages, IPython offers a powerful command shell interface. The Jupyter Notebook addon was used to run the demo in order to get a quantitative performance comparison.

import NumPy as np
from multiprocessing.pool import ThreadPool
pool = ThreadPool(10)

The preceding cell must be run again each time the kernel is changed in the Jupyter menu in order to generate the ThreadPool and provide the runtime outcomes shown below.

The following code, which will be the same line run for each of the three trials, is used with the default Python kernel:

%timeit pool.map(np.linalg.qr, [np.random.random((256, 256)) for i in range(10)])

This approach may be used with the default Python kernel to get a matrix’s eigenvalues. Runtime is significantly improved, up to an order of magnitude, when the Python-m SMP kernel is activated. The Python-m TBB kernel offers an even more significant boost.

OneTBB’s dynamic task scheduler, which most effectively manages code where the innermost parallel areas cannot fully use the system’s CPU and where there may be a variable quantity of work to be done, gives it the greatest performance for this composability example. Although the SMP technique is still quite effective, it often performs best when workloads are more evenly distributed and the loads of the outermost workers are comparable.

Conclusion

In conclusion, using multithreading helps speed up AI/ML processes

Python applications focused on AI and machine learning may be made more efficient in a variety of ways. One of the most important ways to push AI/ML software development workflows to their boundaries will be to leverage the capability of multithreading and multiprocessing. Learn more about the unified, open, standards-based oneAPI programming architecture that serves as the cornerstone of Intel’s AI Software Portfolio, as well as have a look at Intel’s various AI Tools and Framework optimisations.