A Slight Case of Overthreading¶

We still encounter jobs on the HPC cluster that try to use all the cores on the node on which they're running, regardless of how many cores they requested, leading to node alarms. Sometimes, jobs try to use exactly twice or one-and-a-half the allocated cores, or even that number squared. This was a little perplexing at first. In your enthusiasm to parallelize your code, make sure someone else hasn't already done so.

HPC users not using all the cores they request for their jobs is an obvious waste of resources that could be put to better use, especially during one of the cluster's frequent busy spells. However, as discussed in a previous blog post, using more cores than have been requested causes the wasteful overhead of constantly switching between threads that can stop the node from being used by others. Another blog post gives users guidance on finding out what their jobs are doing.

Annoyingly, it's not always users' code that causes the problem, but the third-party software that they've chosen. In one case with a pleasingly happy ending, the Python code in the BRAKER genomics pipeline was not passing on the users' requested number of cores to external binary dependencies in all cases, sometimes causing overthreading. We were able to isolate the code responsible in a GitHub issue, allowing the original developers to supply a fix, to the benefit of all users of the software. This shows the importance of choosing open-source software that has recent commits in its repository and where there is a history of timely responses to users' issues.

Parallelization: Once is Enough¶

Another common problem is that developers don't always realise that some parts of popular Python libraries such as numpy, scipy, scikit-learn and PyTorch can use multiple cores by default. Trying to parallelize code that uses these with the multiprocessing standard library will mean it's been parallelized twice, over-threading will be almost certain. Let us conduct a small experiment. Start a qlogin session with four cores, load the Python module and install Ipython and numpy in a virtualenv. We're installing numpy<2.0 as the recent release of version 2.0 has significant breaking changes. Until all downstream packages that use numpy have been updated, you would be wise to pin the version in virtualenvs and conda/mamba environments, indeed, it's wise to do this with all packages.

qlogin -pe smp 4
module load python
python3 -m venv inv_exp
source inv_exp/bin/activate
pip install 'ipython numpy<2.0'

Finding the inverses of matrices would be a suitably computational intensive task for numpy to do. For that, we need matrices that are invertible. Edit inv_mat.py with vi inv_mat.py:

import numpy as np

def invertible_matrix(n):
    m = np.random.rand(n, n)
    mx = np.sum(np.abs(m), axis=1)
    np.fill_diagonal(m, mx)
    return m

def sum_inv_mat(n):
    return np.linalg.inv(invertible_matrix(n)).sum()

Start an ipython session and see how long it takes for np.linalg.inv to invert eight random 2048 by 2048 matrices:

import numpy as np
from inv_mat import invertible_matrix
mats = [invertible_matrix(2048) for _ in range(8)]
%timeit inverse = [np.linalg.inv(m) for m in mats]

ipython's %timeit "magic command" is handy here.

1.65 s ± 5.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy will have automatically used all four cores because the environment variable OMP_NUM_THREADS was set accordingly when we loaded the Python module. To prove it, try setting it to one and trying again:

OMP_NUM_THREADS=1 ipython

%timeit inverse = [np.linalg.inv(m) for m in mats]

Forcing numpy to use one core has made things worse by a more than a factor of two:

3.84 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy was clearly making use of all four cores, with only the tacit encouragement of setting the OMP_NUM_THREADS variable. One might ponder whether this is entirely a good idea. OMP refers to OpenMP, a cross-platform standard for multiprocessing. The underlying BLAS or LAPACK libraries on which numpy is built all make use of it. The documentation for the np.linalg module does mention this behaviour, and that the rest of numpy is single-threaded by default. Unfortunately, the documentation for individual functions like np.linalg.inv does not. Several cluster users have been caught out by this. It's a good thing that casual users can have computationally demanding tasks accelerated with little effort because the Python module set the variable automatically. It's less good when users find computational resources used without them explicitly asking. Invoke a Python interpreter with python3 -c 'import this' to learn that "Explicit is better than implicit.". A process taking all of a user's cores without asking using multiprocessing.Pool and causing their local machine to lock up to the point where they can't easily kill it is rude. Doing so on a cluster node is positively disastrous.

No Cores for Alarm¶

What happens if we stay with OMP_NUM_THREADS equal to one and spread the inversion of the eight matrices across our four cores with multiprocessing? (Had we not set OMP_NUM_THREADS to one, we'd be trying to switch between sixteen threads with only four cores. numpy would be trying to split each matrix inversion across all four cores in all four processes.)

import multiprocessing

n_cores = int(os.environ.get('NSLOTS', 1))
pool = multiprocessing.Pool(n_cores)
%timeit inverse = pool.map(np.linalg.inv, mats)

We cannot countenance passing a hard-coded value to multiprocessing.Pool, since we've messed with OMP_NUM_THREADS, we must fall back on NSLOTS still being equal to four. For our trouble, we don't quite get a 10% improvement over letting numpy use all four cores by itself:

1.52 s ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Should we take this modest improvement in performance? Not necessarily, in any case, we've made things a little hard for multiprocessing. Each Python process is sent a somewhat large numpy array to pass to np.linalg.inv, and it must send back an equally large array to the parent process. Imagine an even more contrived example, where somehow we can send and receive smaller amounts of data, with large matrices only existing as intermediate steps within each process. The function sum_inv_mat takes the size of a random invertible matrix to create, inverts it and returns its sum. What does this do to performance?

from inv_mat import invertible_matrix
%timeit junk = pool.map(sum_inv_mat, [2048 for _ in range(8)])

Removing the overhead of sending and receiving a large matrix to and from each process gives a further 30% improvement, even though each one now has the extra job of creating the invertible matrix in the first place and then adding all the entries:

1.14 s ± 8.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is a more useful 40% improvement over just letting numpy get on with things. It remains to be seen how this would vary with the number and size of the matrices, and the number of cores used. To find out what really happened, we'd need to use a profiler like the cProfiler module as discussed in the previous blog post. The bottleneck of sending and receiving data to and from each process at the beginning and end is about as simple as parallelization can get, if there is no other inter-process communication. Such cases are said to be embarrassingly, or perhaps pleasingly parallel. Parallelization where there needs to be inter-process communication throughout is far more complex, often with recourse to the OpenMPI message passing interface, which is entirely distinct from OpenMP. (Yes, there are two key parallelization standards in HPC that both start with "OpenMP".)

If we can avoid sending large matrices to and from multiprocessing, should we use this approach? Hypothetically, it could make the difference between a job fitting in Apocrita's ten day limit or not. However there are some drawbacks. We have just created for ourselves the task of informing the end user that they should not use the common idiom of OMP_NUM_THREADS, and then making sure the desired number of cores is sent to each usage of multiprocessing via libraries like argparse or click. The earlier anecdote about the genomics pipeline shows that this process isn't entirely fool-proof. The BioCiphers package Moccasin uses 'np.linalg.inv' inside multiprocessing, doesn't warn users about OMP_NUM_THREADS, and consequently sets them up to cause node alarms. If passing large objects to Pool.map can't easily be avoided, the overhead might be reduced by using shared_memory. PyTorch has its own multiprocessing wrapper that uses shared memory, but it's not without caveats. Remember to factor in the cost of more complex optimized code being harder to maintain and harder to onboard new contributors. The book High Performance Python has plenty of good advice.

Final Thoughts¶

If you parallelize code without checking to see if it's already parallel, you will overthread, and you will cause node alarms.
If your code disrupts the HPC idiom of OMP_NUM_THREADS, protecting the user (usually you) is your problem, until you make it the problem of the research applications team and everyone else using the cluster.
Optimizing code can be interesting and rewarding, but always benchmark and profile to make sure it's worth it.