A Slight Case of Overthreading¶
We still encounter jobs on the HPC cluster that try to use all the cores on the node on which they're running, regardless of how many cores they requested, leading to node alarms. Sometimes, jobs try to use exactly twice or one-and-a-half the allocated cores, or even that number squared. This was a little perplexing at first. In your enthusiasm to parallelize your code, make sure someone else hasn't already done so.
HPC users not using all the cores they request for their jobs is an obvious waste of resources that could be put to better use, especially during one of the cluster's frequent busy spells. However, as discussed in a previous blog post, using more cores than have been requested causes the wasteful overhead of constantly switching between threads that can stop the node from being used by others. Another blog post gives users guidance on finding out what their jobs are doing.
Annoyingly, it's not always users' code that causes the problem, but the third-party software that they've chosen. In one case with a pleasingly happy ending, the Python code in the BRAKER genomics pipeline was not passing on the users' requested number of cores to external binary dependencies in all cases, sometimes causing overthreading. We were able to isolate the code responsible in a GitHub issue, allowing the original developers to supply a fix, to the benefit of all users of the software. This shows the importance of choosing open-source software that has recent commits in its repository and where there is a history of timely responses to users' issues.
Parallelization: Once is Enough¶
Another common problem is that developers don't always realise that some parts
of popular Python libraries such as numpy,
scipy,
scikit-learn and
PyTorch can
use multiple cores by default. Trying to parallelize code that uses these with
the multiprocessing
standard library will mean it's been parallelized twice, over-threading will
be almost certain. Let us conduct a small experiment. Start a
qlogin
session with
four cores, load the Python module
and install Ipython
and numpy in a
virtualenv
.
We're installing numpy<2.0
as the recent release of version 2.0
has
significant breaking changes.
Until all downstream packages that use numpy
have been updated, you would be wise
to pin the version in virtualenvs and
conda/mamba environments, indeed, it's
wise to do this with all packages.
qlogin -pe smp 4
module load python
python3 -m venv inv_exp
source inv_exp/bin/activate
pip install 'ipython numpy<2.0'
Finding the inverses of matrices
would be a suitably computational intensive task for numpy
to do.
For that, we need matrices that are invertible. Edit inv_mat.py
with vi inv_mat.py
:
import numpy as np
def invertible_matrix(n):
m = np.random.rand(n, n)
mx = np.sum(np.abs(m), axis=1)
np.fill_diagonal(m, mx)
return m
def sum_inv_mat(n):
return np.linalg.inv(invertible_matrix(n)).sum()
Start an ipython
session and see how long it takes for
np.linalg.inv
to invert eight random 2048 by 2048 matrices:
import numpy as np
from inv_mat import invertible_matrix
mats = [invertible_matrix(2048) for _ in range(8)]
%timeit inverse = [np.linalg.inv(m) for m in mats]
ipython
's %timeit
"magic command"
is handy here.
1.65 s ± 5.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
will have automatically used all four cores because the
environment variable OMP_NUM_THREADS
was set accordingly when we loaded the
Python module. To prove it, try setting it to one and trying again:
OMP_NUM_THREADS=1 ipython
%timeit inverse = [np.linalg.inv(m) for m in mats]
Forcing numpy
to use one core has made things worse by a more than a factor of
two:
3.84 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
was clearly making use of all four cores, with only the tacit
encouragement of setting the OMP_NUM_THREADS
variable. One might ponder
whether this is entirely a good idea. OMP
refers to OpenMP,
a cross-platform standard for multiprocessing. The underlying
BLAS or LAPACK
libraries on which numpy
is built all make use of it.
The documentation for the np.linalg
module does mention this behaviour, and that the rest of numpy
is
single-threaded by default. Unfortunately, the documentation for individual functions
like np.linalg.inv
does not. Several cluster users have been caught out by this.
It's a good thing that casual users can have computationally demanding tasks
accelerated with little effort because the Python module
set the variable automatically. It's less good when users find computational
resources used without them explicitly asking. Invoke a Python interpreter
with python3 -c 'import this'
to learn that
"Explicit is better than implicit.
".
A process taking all of a user's cores without asking using multiprocessing.Pool
and causing their local machine to lock up to the point where they can't
easily kill it is rude. Doing so on a cluster node is positively disastrous.
No Cores for Alarm¶
What happens if we stay with OMP_NUM_THREADS
equal to one and spread
the inversion of the eight matrices across our four cores with
multiprocessing
?
(Had we not set OMP_NUM_THREADS
to one, we'd be trying to switch
between sixteen threads with only four cores. numpy
would be
trying to split each matrix inversion across all four cores
in all four processes.)
import multiprocessing
n_cores = int(os.environ.get('NSLOTS', 1))
pool = multiprocessing.Pool(n_cores)
%timeit inverse = pool.map(np.linalg.inv, mats)
We cannot countenance passing a hard-coded value to multiprocessing.Pool
,
since we've messed with OMP_NUM_THREADS
, we must fall back on NSLOTS
still being equal to four. For our trouble, we don't quite get a 10% improvement
over letting numpy
use all four cores by itself:
1.52 s ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Should we take this modest improvement in performance? Not necessarily,
in any case, we've made things a little hard for multiprocessing
. Each
Python process is sent a somewhat large numpy
array to pass to
np.linalg.inv
,
and it must send back an equally large array to the parent
process. Imagine an even more contrived example, where somehow we can send
and receive smaller amounts of data, with large matrices only existing as
intermediate steps within each process. The function sum_inv_mat
takes
the size of a random invertible matrix to create, inverts it and returns
its sum. What does this do to performance?
from inv_mat import invertible_matrix
%timeit junk = pool.map(sum_inv_mat, [2048 for _ in range(8)])
Removing the overhead of sending and receiving a large matrix to and from each process gives a further 30% improvement, even though each one now has the extra job of creating the invertible matrix in the first place and then adding all the entries:
1.14 s ± 8.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is a more useful 40% improvement over just letting numpy
get on with things. It remains to be seen how this would vary with
the number and size of the matrices, and the number of cores used.
To find out what really happened, we'd need to use a profiler
like the cProfiler
module
as discussed in the previous blog post.
The bottleneck of sending and receiving data to and from each process
at the beginning and end is about as simple as parallelization can get,
if there is no other inter-process communication. Such cases are said to be
embarrassingly,
or perhaps pleasingly
parallel. Parallelization where there needs to be inter-process communication throughout
is far more complex, often with recourse to the OpenMPI
message passing interface, which is entirely distinct from OpenMP.
(Yes, there are two key parallelization standards in HPC that both start with "OpenMP".)
If we can avoid sending large matrices to
and from multiprocessing
, should we use this approach?
Hypothetically, it could make the difference between a job fitting
in Apocrita's ten day limit or not. However there are some drawbacks.
We have just created for ourselves the task of informing the end
user that they should not use the common idiom of OMP_NUM_THREADS
,
and then making sure the desired number of cores is sent to each
usage of multiprocessing
via libraries like
argparse
or
click
. The earlier
anecdote
about the genomics pipeline shows
that this process isn't entirely fool-proof. The BioCiphers package
Moccasin uses
'np.linalg.inv'
inside multiprocessing
, doesn't warn users about OMP_NUM_THREADS
,
and consequently sets them up to cause node alarms.
If passing large objects to Pool.map
can't easily be avoided, the
overhead might be reduced by using
shared_memory
.
PyTorch has its own multiprocessing
wrapper
that uses shared memory, but it's not without caveats.
Remember to factor in the cost of more complex optimized code being harder to maintain
and harder to onboard new contributors. The book
High Performance Python
has plenty of good advice.
Final Thoughts¶
-
If you parallelize code without checking to see if it's already parallel, you will overthread, and you will cause node alarms.
-
If your code disrupts the HPC idiom of
OMP_NUM_THREADS
, protecting the user (usually you) is your problem, until you make it the problem of the research applications team and everyone else using the cluster. -
Optimizing code can be interesting and rewarding, but always benchmark and profile to make sure it's worth it.