Running Machine Learning workloads on Apocrita¶
In this tutorial we'll be showing you how to run a TensorFlow job using the GPU nodes on the Apocrita HPC cluster. We will expand upon the essentials provided on the QMUL HPC docs site, and provide more explanation of the process. We'll start with software installation before demonstrating a simple task and a more complex real-world example that you can adapt for your own jobs, along with tips on how to check if the GPU is being used.
Available hardware¶
GPU cards can provide huge acceleration to certain workloads, particularly in the field of Machine Learning.
The QMUL Apocrita HPC cluster has the following GPU enabled nodes:
- 4 nxg nodes with NVIDIA Kepler K80 (effectively dual K40) cards.
- 3 sbg nodes with 4 x NVIDIA Volta V100 cards each.
- 1 sbg node with 4 x NVIDIA Ampere A100 cards.
- 16 sbg nodes with 4 x NVIDIA Ampere A100 cards (access for DERI Andrena cluster users only).
Installation¶
Using pip and virtualenv¶
TensorFlow for GPU is provided as a compiled package for the pip
and conda
environments, and hence can be installed by the user. For simplicity we will
focus on the pip
method. The TensorFlow instructions for pip
and conda
are also provided on the Apocrita HPC
documentation site.
The procedure follows the standard method for virtual environments on a shared system.
Virtual environments allow us to install different collections of python packages without experiencing conflicts, or versioning issues.
Loading applications using the module command¶
Running module avail python
will show the available python versions; module
load python
without the version number will load the default version into the
current session, and will also provide the pip and virtualenv commands. On
Apocrita, the default python module version is a recent python3 version,
shown below:
$ module avail python
----------- /share/apps/environmentmodules/centos7/general ---------------
python/2.7.15 python/3.6.3 python/3.8.5(default)
$ module load python
$ module list
Currently Loaded Modulefiles:
1) python/3.8.5(default)
Use Python 3 instead of Python 2
The Python project announced that Python 2 will not receive any updates, including security updates after Jan 1, 2020, and you should ensure that your code is Python 3 compliant.
Installing TensorFlow GPU package in a virtual environment¶
We will now demonstrate how to install the TensorFlow GPU package, using the following steps:
- load the python module
- set up a new virtual environment in your home directory (we will use
tensorgpu
in this example) - activate the
tensorgpu
virtual environment - install the TensorFlow package into the active environment
module load python
virtualenv ~/tensorgpu
source ~/tensorgpu/bin/activate
pip install tensorflow
For releases 1.15 and older, CPU and GPU packages are separate:
pip install tensorflow==1.15 # CPU
pip install tensorflow-gpu==1.15 # GPU
Any Tensorflow dependencies will be installed at the same time. Notice that the
session prompt becomes prefixed by the name of the currently activated
virtualenv, as a handy visual reminder. You can deactivate the current
virtualenv with the deactivate
command.
Now we have a virtual environment which can be loaded again on demand. To do so in a new session, or job script, we load the python module and source our virtualenv. Ensure you load the same python module that was used to create the virtualenv, to benefit from thread optimisation and shared library support.
module load python
source ~/tensorgpu/bin/activate
While in an activated environment, running the pip freeze
command will show
installed packages and their version number. It's good practice to keep a copy
of this output in case you need to re-create this environment in future.
Installing packages in a virtualenv only needs to be done once
A common mistake is for new users to include the virtualenv creation and
pip install
commands in their job script - however after the correct
packages have been installed, all that is required to use them, is to
activate the virtualenv (from within your job script, etc)."
TensorFlow and CUDA/CUDNN library versions¶
The GPU version of TensorFlow requires the CUDA and CUDNN modules to be loaded in the environment. Loading the correct CUDNN module will load in the accompanying CUDA version as a dependency. Loading the incorrect CUDA/CUDNN module for the relevant TensorFlow version will result in errors at runtime, resulting in fallback to CPU-only mode.
TensorFlow version | CUDNN version | CUDA version |
---|---|---|
2.4 - 2.6 | 8.1 | 11 |
2.1 - 2.3 | 7.6 | 10.1 |
1.13.1 - 2.0 | 7.4 | 10 |
1.5 - 1.12 | 7 | 9 |
<1.4 | 6 | 8 |
Installing a specific version of a package¶
Instead of installing the latest package, for compatibility reasons, you may
require a specific version. For example, pip install tensorflow-gpu==1.15
will install the exact version, if it is available.
Bulk install of packages using a requirements file¶
A requirements
file, in the format produced by pip freeze
, will install all listed packages
with the use of pip install -r requirements.txt
, in a rapid and reproducible
manner.
For example, given a set of required packages for your job, make a
requirements.txt
file containing the packages (and version numbers as
necessary). The following list is just an example of what that might look like:
Keras==2.4.3
matplotlib==3.4.1
pandas==1.2.4
sklearn
tensorflow==2.4.1
Create a fresh environment (which we will call myenv
) and install the
packages:
module load python
virtualenv myenv
source myenv/bin/activate
pip install -r requirements.txt
Additional dependencies will be pulled in as required, or as a preferred
approach, supply the whole output of pip freeze
from a known good virtualenv
you have set up previously, which will also include the dependencies.
Running a simple job¶
All work must be submitted via the job scheduler, to ensure optimal and fair use of resources. This basic job will check that you can access a GPU node, load your environment, run TensorFlow and output the TensorFlow version. Before running a GPU job, you need to request addition to the GPU user access list, while providing an example of a typical job script you will be running, so we can avoid situations where a user runs a lot of jobs that request GPU resources but don't use them.
In a text editor, create the file basic.qsub
. Note that it's best to create
and edit files using a text editor on the HPC system, such as vim, nano or
emacs, rather than creating them on your local workstation. This avoids a
common issue
with Windows control-characters, and also ensures a more streamlined work-flow.
#!/bin/bash
#$ -cwd
#$ -j y # Merge output and error files (optional)
#$ -pe smp 8 # Request cores (8 per GPU)
#$ -l h_vmem=7.5G # Request RAM (7.5GB per core)
#$ -l h_rt=240:0:0 # Request maximum runtime (10 days)
#$ -l gpu=1 # Request 1 GPU
#$ -N basicGPU # Name for the job (optional)
# Load the necessary modules
module load python
module load cudnn/8.1.1-cuda11.2
# Load the virtualenv containing the tensorflow package
source ~/tensorgpu/bin/activate
# Report the TensorFlow version
python -c 'import tensorflow as tf; print(tf.__version__)'
Running qsub basic.qsub
will tell the scheduler to add the job to the queue.
You can verify this with the qstat
command. Note that, while usually the
rules about resource requests are very strict (request only what you will use),
the convention is to request 8 cores per GPU.
If there are free resources, the job will run immediately and produce an output file a few seconds later containing the results of the job. See this page for an explanation of job output filenames, and here for more detail on using GPU nodes.
Running a real-life job¶
The prerequisite for this job is a TensorFlow virtualenv and a copy of
mnist_classify.py
code from below and also on
GitHub.
[1]
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10)
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print('\nTest accuracy:', test_acc)
Prepare mnist_classify.qsub
:
#!/bin/bash
#$ -cwd
#$ -j y # Merge output and error files (optional)
#$ -pe smp 8 # Request cores (8 per GPU)
#$ -l h_vmem=7.5G # Request RAM (7.5GB per core)
#$ -l h_rt=240:0:0 # Request maximum runtime (10 days)
#$ -m bea # Send email on begin,end,abort (optional)
#$ -l gpu=1 # Request 1 GPU
#$ -N mnist_classify # Name for the job (optional)
# Load necessary modules
module load python
module load cudnn/8.1.1-cuda11.2
# Load the virtualenv containing the tensorflow package
source ~/tensorgpu/bin/activate
# Run the mnist_classify.py code
python mnist_classify.py
Since GPU are the primary resource on the nodes, we request that users standardise their CPU and RAM requests on GPU nodes to ensure non-GPU resources are shared evenly between GPU devices without too much effort from users. This equates to 8 cores and 7.5GB RAM per core, for each GPU requested.
Submit the job with qsub mnist_classify.qsub
and check the status of your
queued and running jobs with qstat
.
$ qsub mnist_classify.qsub
Your job 630581 ("mnist_classify") has been submitted
$ qstat
job-ID prior name user state submit/start at queue slots
----------------------------------------------------------------------------------
630581 15.00646 mnist_classify abc123 r 03/22/2021 09:57:00 all.q@nxg1 8
We have added -m bea
in the job script to send an email to notify when the
job begins/ends/aborts.
Checking the progress of your job¶
If your jobs starts immediately, you can ssh to the node and run nvidia-smi
to check the GPU device activity and attached processes.
Your process will be a python process. Note that another user will
likely be using one of the other GPU, which may also be python. The first few
lines of the job output file mnist_classify.o<jobid>
will mention a GPU
device being used (note that it might state GPU 0 even when using another GPU
device, because only the GPUs you have requested are visible to you, starting
at GPU 0).
+-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:83:00.0 Off | 0 |
| N/A 46C P0 113W / 149W | 38403MiB / 11441MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:84:00.0 Off | 0 |
| N/A 70C P0 135W / 149W | 767MiB / 11441MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 17053 C python 38403MiB |
| 1 N/A N/A 18998 C python 756MiB |
+-----------------------------------------------------------------------------+
One of the ways to check which of the tasks is yours, is to use the ps
command and search for the process IDs attached to each GPU. For
example:
$ ps -f 17053 18998
UID PID PPID C STIME TTY STAT TIME CMD
abc123 17053 13708 99 16:16 ? Sl 55:13 python /data/home/abc123/.../mnist_classify.py
xyz987 18998 19483 99 10:45 ? Sl 675:27 python tuning.py
In this case, process id 17053 is owned by user abc123 and is using GPU 0, and at this particular moment, is consuming 38403MiB of GPU RAM, and 100% GPU utilisation. The GPU usage may fluctuate over the course of the job, but consistently low figures may be an indication that some settings could be tweaked, to gain better performance.
We have confirmed that the job is using a GPU, and we will now inspect the job output file. The file will be created in the same directory where you ran the job, which is a concatenation of the job name and the job id number. If there is no job name provided in the job script file, then the file name of the script file is taken instead.
In this example, the job runs for around 1-2 minutes and the output file is
mnist_classify.o630581
, we can inspect the file using
less mnist_classify.o630581
.
We've truncated parts of the output but it is important to check the CUDA library messages that appear to ensure the code is being run on the GPU. If there are any missing libraries or error messages, the code might not run as expected, or may continue to run on the CPU.
Variable OMP_NUM_THREADS has been set to 8
Loading cudnn/8.1.1-cuda11.2
Loading requirement: cuda/11.2.2
[cuda library messages]
Epoch 1/10
32/1875 [..............................] - ETA: 9s - loss: 1.9411 - accuracy: 0.3788
149/1875 [=>............................] - ETA: 4s - loss: 1.2637 - accuracy: 0.6238
300/1875 [===>..........................] - ETA: 2s - loss: 0.9795 - accuracy: 0.7124
452/1875 [======>.......................] - ETA: 2s - loss: 0.8372 - accuracy: 0.7553
604/1875 [========>.....................] - ETA: 2s - loss: 0.7483 - accuracy: 0.7818
753/1875 [===========>..................] - ETA: 1s - loss: 0.6869 - accuracy: 0.7998
897/1875 [=============>................] - ETA: 1s - loss: 0.6419 - accuracy: 0.8130
1043/1875 [===============>..............] - ETA: 1s - loss: 0.6055 - accuracy: 0.8237
1191/1875 [==================>...........] - ETA: 1s - loss: 0.5753 - accuracy: 0.8305
1341/1875 [====================>.........] - ETA: 0s - loss: 0.5496 - accuracy: 0.8383
1493/1875 [======================>.......] - ETA: 0s - loss: 0.5274 - accuracy: 0.8451
1639/1875 [=========================>....] - ETA: 0s - loss: 0.5133 - accuracy: 0.8508
1786/1875 [===========================>..] - ETA: 0s - loss: 0.4963 - accuracy: 0.8557
1875/1875 [==============================] - 6s 1ms/step - loss: 0.4833 - accuracy: 0.8595
...
Epoch 10/10
103/1875 [>.............................] - ETA: 2s - loss: 0.0380 - accuracy: 0.9891
252/1875 [===>..........................] - ETA: 2s - loss: 0.0377 - accuracy: 0.9891
402/1875 [=====>........................] - ETA: 2s - loss: 0.0393 - accuracy: 0.9885
553/1875 [=======>......................] - ETA: 1s - loss: 0.0397 - accuracy: 0.9882
704/1875 [==========>...................] - ETA: 1s - loss: 0.0403 - accuracy: 0.9879
855/1875 [============>.................] - ETA: 1s - loss: 0.0406 - accuracy: 0.9877
990/1875 [==============>...............] - ETA: 1s - loss: 0.0408 - accuracy: 0.9875
1142/1875 [=================>............] - ETA: 1s - loss: 0.0412 - accuracy: 0.9873
1290/1875 [===================>..........] - ETA: 0s - loss: 0.0415 - accuracy: 0.9872
1437/1875 [=====================>........] - ETA: 0s - loss: 0.0417 - accuracy: 0.9870
1586/1875 [========================>.....] - ETA: 0s - loss: 0.0418 - accuracy: 0.9870
1732/1875 [==========================>...] - ETA: 0s - loss: 0.0419 - accuracy: 0.9869
1875/1875 [==============================] - 3s 1ms/step - loss: 0. - accuracy: 0.9868
420 - accuracy: 0.9868
313/313 - 0s - loss: 0.0700 - accuracy: 0.9803
Test accuracy: 0.9803000092506409
We can see that the job initialised with a GPU device, and the job is progressing. It's important to inspect this information, as a badly configured job may not utilise GPU at all, and result in very poor performance, and blocking a GPU from use by another researcher.
To check the latest progress of the job, you can use tail -f <filename>
to
show the end of the file, and continue to output data as the file grows.
Use of multiple GPUs¶
If your code supports it, you may request more than one GPU for your job. Note
that requesting 2 GPUs does not automatically mean that both GPU will be used,
so it's good practice to check nvidia-smi
each time you try new software.
Other Machine Learning applications¶
We've worked through a detailed approach for running TensorFlow jobs, which can
largely be applied to other frameworks such as
PyTorch which are also available via
pip
and conda
. Some packages involve additional dependencies, which may
not be available in the standard python package repositories, and require
installing manually from code repositories. Please
get in touch if you need extra
assistance.
Visualisation with TensorBoard¶
TensorBoard is a web-based visualisation tool to allow you to analyse the progress your training. It comes installed with TensorFlow and includes the following features:
- tracking metrics such as loss and accuracy
- displaying image/audio data
- visualising the model graph
To visualise your data using TensorBoard, please see our Using TensorBoard via OnDemand page on our docs site for further information.
References¶
[1] Princeton University GitHub, (2020)
TensorFlow logo image: Licence