Memory usage can be broadly simplified into two values: Virtual memory
(VMEM) which a program believes it has, and Physical memory (also known as
"Resident Set Size" or RSS for short), which is the actual amount of memory
it uses.
NVIDIA recently announced the GH200 Grace Hopper Superchip which is a
combined CPU+GPU with high memory bandwidth, designed for AI workloads. These
will also feature in the forthcoming Isambard
AI National supercomputer. We were offered the chance to pick up a couple of
these new servers for a very attractive launch price.
The CPU is a 72-core ARM-based Grace processor, which is connected to an H100
GPU via the NVIDIA chip-2-chip interconnect, which delivers 7x the bandwidth of
PCIe Gen5, commonly found in our other GPU nodes. This effectively allows the
GPU to seamlessly access the system memory. This
datasheet
contains further details.
Since this new chip offers a lot of potential for accelerating AI workloads,
particularly for workloads requiring large amounts of GPU RAM or involving a
lot of memory copying between the host and the GPU, we've been running a few
tests to see how this compares with the alternatives.
Compression tools can significantly reduce the amount of disk space consumed by
your data. In this article, we will look at the effectiveness of some
compression tools on real-world data sets, make some recommendations, and
perhaps persuade you that compression is worth the effort.
We have recently procured 120TB of NVMe based SSD storage from E8 Storage for
the Apocrita HPC Cluster. The plan is to deploy this to replace our oldest
and slowest provision of scratch storage. We have been performing extensive
testing on this new storage as we expect it to offer new possibilities and
advantages within the cluster.
Singularity (now Apptainer) is a container
solution designed for HPC. Due to the secure and simple design, it can be
easily used to provide applications for use with HPC clusters where other
containers, such as Docker would not be suitable.