Using Apocrita's GPUs with OpenMP¶
On Apocrita we can use OpenMP to execute code on GPU devices. This post looks at how to compile such programs and submit them to run on the GPU nodes. The post assumes that you have code, already developed and tested, which is ready for deployment, and that you have been granted access to the GPU nodes.
If you do not yet have suitable code, or do not have access to the GPU nodes, please see below to contact us.
OpenMP target offload¶
In this post we'll look at OpenMP programs which use the target
construct for offloading
execution to a non-host device. On Apocrita, the non-host devices are GPUs. To
use the GPU nodes on Apocrita you will have code that is ready to deploy, but
for this post we'll use the following C and Fortran examples.
test-gpu-openmp.c
:
#include <math.h>
#include <stdio.h>
int main () {
float x, y;
int i, j;
x=0;
#pragma omp target map(tofrom:x)
#pragma omp parallel for private(j,y) reduction(+:x)
for (i=0; i<1000000; i++) {
y=0.5;
for (j=0; j<1000; j++) y=(y+sqrtf(y)/2);
x=x+1-y;
}
printf("Our final result is: %f\n", x);
return 0;
}
test-gpu-openmp.f90
:
implicit none
integer :: i, j
real :: x, y
x=0
!$omp target map(tofrom:x)
!$omp parallel do private(j,y) reduction(+:x)
do i=1,1000000
y=0.5
do j=1,1000
y=(y+SQRT(y))/2
end do
x=x+1-y
end do
!$omp end target
print '("Our final result is: ",G0.5)', x
end
We have on Apocrita a number of GPU node types. For the examples in this
post we'll first use the short interactive queues to access a node.
To access one of our usual nodes we can use the qlogin
command:
qlogin -pe smp 8 -l gpu=1
For more complete detail of how to request GPU resources for your jobs please see our documentation pages linked below.
To compile your code, and the examples, you will need to use one of the correct compilers. Currently, the only suitable compiler on Apocrita is GCC (versions greater than 10.2.0) for all nodes. Other compilers, and versions of GCC, may fail to understand the syntax of OpenMP offload or may fail to generate code for offloading. For example, versions of GCC before 10.2.0 on Apocrita do not have offload support and the Intel compilers capable of producing offload code do not target the GPU device types we have.
Compiling with GCC¶
Once we have a session on one of Apocrita's GPU nodes we can move to a directory with the source code example files above. To compile with GCC we first load a module which provides offload support, in this case version 10.2.0:
module load gcc/10.2.0
To check that this installation of GCC supports the desired offload we can
ask the gcc
command:
$ gcc -v
...
OFFLOAD_TARGET_NAMES=nvptx-none
...
This offload target nvptx-none
is the required target for our NVIDIA
devices: if gcc -v
doesn't report this target then code will not be offloaded
to our GPUs and this may mean that you haven't used an appropriate compiler
module.
To compile our C and Fortran examples with offloading we need to enable
OpenMP in the compiler with the -fopenmp
compile option. For our Fortran
example we can compile like:
gfortran -fopenmp test-gpu-openmp.f90 -o test-fortran
In our C example, we use the function sqrtf
so we need to link against the
maths library. We need to link against this library both with the host code
(-lm
) and the device code (-loffload=-lm
):
gcc -fopenmp test-gpu-openmp.c -o test-c -lm -foffload=-lm
The Fortran code of our example doesn't require a comparable -foffload=...
linker option, but some Fortran code may require the GFortran runtime library
in the device. In this case -loffload=-lgfortran
should be specified but
-lgfortran
, to link the host code against the library, will be implied if
linking with gfortran
directly:
gfortran -fopenmp offload-rtl.f90 -loffload=-lgfortran -o test-fortran
With the compilation complete we should now have two compiled programs,
test-c
and test-fortran
, ready to execute.
Running the compiled programs¶
Once we have the compiled program and a session on a suitable GPU node we can
run the program and watch it execute on the GPU device. We can continue to use
the same qlogin
session that we had when compiling, but we can also create
a new session. Either way, we must be sure to have loaded the same module
that we used when compiling our code. For our examples that would be
gcc/10.2.0
. We advise further that you use the same GPU selection type
(-l gpu_type=volta
for example) for maximum portability if you are compiling
and running in separate sessions.
On every node, the compiled program can be run like any other, not requiring any special options to enable offloading:
$ ./test-c &
[1] 129036
$ nvidia-smi pmon
# gpu pid type sm mem enc dec command
# Idx # C/G % % % % name
0 - - - - - - -
0 129036 C 39 0 0 0 test-c
0 129036 C 92 0 0 0 test-c
0 129036 C 99 0 0 0 test-c
0 129036 C 99 0 0 0 test-c
In this case we ran the program test-c
in the background (using the &
after
the command name). The shell immediately returns the process ID of the running
program (here 129036
) and then we run nvidia-smi
to query the state of the
GPU to which the process is assigned.
After a little delay attaching the process and fully starting up, we see that
the process, and matching command name, report under the sm
column a high
use. Our example code does not require much memory (as reported by the mem
column).
If when observing the process running over time we not see a process attach to the device, or use remains low (particularly 0%) then the offload may have failed.
The running examples will quickly return a final result.
Running the program in batch mode¶
Until now we've been using interactive jobs in the short queues. As with other programs we want to run on the cluster, once we're satisfied that the jobs are set up correctly we should submit them to the scheduler to run in batch as resources become available.
For this example, we've seen that our program has compiled correctly and has been running on the GPU devices as we'd expect. If we've compiled the program using GCC 10.2.0 targeting Volta nodes, for example, we can submit the script
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l gpu=1
#$ -l gpu_type=volta
#$ -l h_rt=240:0:0
module load gcc/10.2.0
./test-fortran
Using target
constructs without the GPUs¶
We've seen how target
constructs can be used to offload execution to GPU
devices. However, we can also use source code which has these constructs
without using the GPU nodes, for testing or other reasons.
Using GCC compilers we can enable OpenMP using fopenmp
but disable
offloading with -foffload=disable
:
gcc -fopenmp -foffload=disable test-gpu-openmp.c -o test-c -lm
gfortran -fopenmp -foffload=disable test-gpu-openmp.f90 -o test-fortran
Equally, we could disable OpenMP compilation completely by using -fno-openmp
or not giving -fopenmp
.
Finally, code compiled with offload support may also be run when a GPU device
is not available: when the target
construct is reached execution will
continue on the host (CPU) when no offload device (GPU) is detected. For
extreme testing cases offloading may be disabled when requested at compile
time and when a GPU is available by using an environment variable: this
should generally be avoided, however, as jobs requesting GPU nodes must usually
make good use of the GPU devices.
Further documentation for Apocrita¶
For more detail on compiling on Apocrita, including using the GPUs please see our documentation page on compiling and our introductory post about compilation. We also have more detailed documentation about how to submit jobs to run on the GPU nodes.
Contact us¶
The GPU nodes on Apocrita are restricted to approved users. Contact us, with details of your requirements and example code, to request access to these nodes.
Also contact us if you would like support in porting your existing OpenMP code to use GPUs or want an assessment of the possible value in using GPUs.