Python performance in containers¶
Singularity (now Apptainer) is a container solution designed for HPC. Due to the secure and simple design, it can be easily used to provide applications for use with HPC clusters where other containers, such as Docker would not be suitable.
Linux containers offer a great way to compare software performance on a production system without installing extra packages on your shared filesystem.
After early tests (on 3 different node types using python 2.7.8 on CentOS6 containers) showed SCL python running code 16-25% more slowly than a self-compiled python 2.7.8 running natively, further tests were performed, instead with CentOS 7.
A single-core python job to find primes of a large number was run natively on a compute node via the Univa Grid Engine job scheduler. The job was then repeated on the same node inside a container running the same version of python compiled with gcc, and also a packaged python provided by the CentOS Software Collections Library. The SCL is marketed as a simple way to get multiple versions of python on your enterprise OS without having to compile new versions. Jobs were run using CentOS 7.3 and python 2.7.13 supplied in the following ways:
- (A) Native OS
- (B) Singularity container, python compiled from source
- (C) Singularity container, python compiled from source, configured with
--enable-optimizations
flag - (D) Singularity container, python supplied via CentOS SCL
Where python was compiled, the same gcc version 4.8.5 was used. Tasks A and B can be compared directly to establish if there is any performance impact on using containers vs native OS.
Method¶
Each task was run multiple times on an Lenovo Nextscale nx360m5compute node with 2x Xeon E5-2683v4 processors. The first experiment involved running one job at a time on an exclusive node (in total 10 jobs for each of the 4 different task types). The second experiment ran batches of 10 jobs at a time on an exclusive node, grouped by task type.
Wallclock values were taken from the Grid Engine qacct
command. The total
wallclock time is reported by the job scheduler, and includes any overheads a
container may introduce, such as time to load a container file. The measured
time does not include the time taken to perform prerequisite tasks such as
creating the re-usable container or compile python. Jobs were run on a node
with no other jobs running, but part of a production system that is
susceptible to external influences such as GPFS filesystem load and system
temperatures.
Results¶
The plot shows the time taken to complete each job. Results from both experiments are shown.
Mean values over 10 runs on Xeon E5-2683v4 (jobs running separately)¶
Task | Walltime/s | I/O ops/s | Memory/MB |
---|---|---|---|
Native | 4682 | 3759 | 560 |
Container | 4608 | 3971 | 554 |
Container, enable-optimisations |
9544 | 3948 | 1142 |
Container, SCL | 5564 | 4009 | 676 |
Mean values over 10 runs on Xeon E5-2683v4 (like-tasks run simultaneously)¶
Task | Walltime/s | I/O op/s | Memory/MB |
---|---|---|---|
Native | 5060 | 3779 | 605 |
Container | 5066 | 3966 | 605 |
Container, enable-optimisations |
10440 | 3967 | 1244 |
Container, SCL | 6022 | 4033 | 726 |
Summary¶
It can be observed from the results that for the single core, CPU-bound job, the python compiled inside a container produces similar results to the compiled python on the native OS. Therefore, the container does not affect performance, in terms of CPU resource or RAM. In fact, the python container was often faster for the separate runs.
The simultaneous run tasks showed less jitter in the results, as expected, since all jobs of each type were run at the same time. The average runtime was increased slightly for all tasks, including the native OS. Some thermal throttling may be occurring on the CPU as 10 processors would be maxed out instead of 1 for the job duration.
The SCL python runs consistently slower than the compiled python on all nodes types, suggesting that, although a convenient method for system administrators to provide alternate versions, it may not be suitable for HPC environments. A container running python would perform better.
The python compiled with --enable-optimizations
performed very poorly, and
should be a lesson to system administrators not to blindly follow suggestions
without testing. Quite why it performed so badly requires further
investigation.
Containers provide an excellent way to provision tricky applications, particularly in the Bioinformatics and Deep-learning disciplines, but additionally provide a safe and easy way to gain performance improvements and new features offered by the latest and greatest versions of common applications, while offering an easy way to compare different configuration and compilation options that, as we have seen, could have significant impact on performance, which is critical in an HPC environment.
Data files¶
The Singularity definition files used for these tests can be found on Github.
References¶
- Python project
- CentOS
- Singularity
- Work was performed on QMUL's Apocrita Cluster
Title image: Teng Yuhong on Unsplash