Limiting Parallelism in scikit-learn
Scikit-learn uses three levels of parallelism by default:
- Multiple processes with joblib
- Threads with OpenMP
- Threads in the BLAS routines used by Numpy and Scipy
I’ve recently been running into issues with KMeans clustering on my System76 Thelio Astra workstation with 128 cores. Primarily, the Python interpreter would segfault. Although the Jupyter web interface didn’t provide any hints, the Jupyter notebook console output the following error message:
OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata.
To avoid this warning, please rebuild your copy of OpenBLAS with a larger NUM_THREADS setting
or set the environment variable OPENBLAS_NUM_THREADS to 64 or lower
The KMeans class used to have a parameter n_jobs
that would allow the user to set the number of processes launched. This parameter
was deprecated in version 0.23 and removed in version 0.25. A new parallelization scheme
using OpenMP was implemented to provide finer-grained parallelism and improve performance. In effect, the total number of threads may be
as high as OMP_NUM_THREADS * OPENBLAS_NUM_THREADS
. By default, the new KMeans parallelization process will use one OpenMP thread per
core. Each OpenMP thread may use Numpy or Scipy routines that use a BLAS library like OpenBLAS that in turns spawns its own threads.
When I converted the notebook into a standalone script, the Python interpreter output the following error message before crashing:
double free or corruption (out)
Aborted (core dumped)
I was looking for a way to reduce the number of OpenBLAS threads. The OPENBLAS_NUM_THREADS
and OMP_NUM_THREADS
environmental
variables are one mechanism but not straightforward to use with Jupyter notebooks and set separately for each notebook. The scikit-learn
documentation suggests using the threadpoolctl
library to control
the number of parallel units at each level. Using threadpoolctl, I was not only able to reduce the number of OpenBLAS threads to
avoid the over-allocation but workaround the interpreter crashes.
Unfortunately, neither the scikit-learn or threadpoolctl documentation are particularly clear on how to use it. I’ve attempted to describe how to use the functions of the threadpoolctl library with a little more context.
- Import your libraries of interest since the linked parallelization libraries are loaded and can be discovered:
import mdtraj import numpy as np import scipy as sp import sklearn
- Discover the parallelization libraries being used by running:
from threadpoolctl import threadpool_info from pprint import pprint pprint(threadpool_info())
which will output something like:
[{'architecture': 'neoversen1', 'filepath': '/home/rnowling/openmm-venv/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-0f683016.so', 'internal_api': 'openblas', 'num_threads': 32, 'prefix': 'libscipy_openblas', 'threading_layer': 'pthreads', 'user_api': 'blas', 'version': '0.3.27'}, {'filepath': '/home/rnowling/openmm-venv/lib/python3.12/site-packages/mdtraj.libs/libgomp-d22c30c5.so.1.0.0', 'internal_api': 'openmp', 'num_threads': 128, 'prefix': 'libgomp', 'user_api': 'openmp', 'version': None}, {'architecture': 'neoversen1', 'filepath': '/home/rnowling/openmm-venv/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-9778f98e.so', 'internal_api': 'openblas', 'num_threads': 32, 'prefix': 'libscipy_openblas', 'threading_layer': 'pthreads', 'user_api': 'blas', 'version': '0.3.28'}]
-
Use threadpoolctl’s context managers to limit the number of threads at each level:
from threadpoolctl import threadpool_limits with threadpool_limits(limits=8, user_api="openmp"): with threadpool_limits(limits=4, user_api="blas"): ...
This code snippet will limit OpenMP to spawning 8 threads and OpenBLAS to spawning 4 threads.
The threadpool_limits
context manager parameters are interpreted as follows:
limits
: The maximum number of parallel execution units (e.g., threads)user_api
: The name of a library type given in theuser_api
field of the discovery output.