In this blog post, we will play about with neural networks, on a dataset called
ImageNet, to give some intuition on how these neural networks work. We will
train them on Apocrita with
DistributedDataParallel
and show benchmarks to give you a guide on how many GPUs to use. This is a
follow on from a previous blog post where we explained how to
use DistributedDataParallel to speed up your neural network training with
multiple GPUs.
The delivery of new GPUs for research is continuing, most notable is the new
Isambard-AI cluster at
Bristol. As new cutting-edge GPUs are released, software engineers are
tasked with being made aware of the new architectures and features these new
GPUs offer.
The new Grace-Hopper GH200 nodes, as announced in a previous blog
post, consist of a 72-core NVIDIA Grace CPU and an
H100 Tensor Core GPU. One of the key innovations is the NVIDIA NVLink
Chip-2-Chip (C2C) and unified memory, which allows fast and seamless automation
of transferring data from CPU to GPU. It also allows the GPU to be
oversubscribed, allowing it to handle data much larger than it can host,
potentially tackling out-of-GPU memory problems. This allows software engineers
to focus on implementing algorithms without having to think too much about
memory management.
This blog post will demonstrate manual GPU memory management and introduce
managed and unified memory with simple examples to illustrate its benefits.
We'll try and keep this to an introductory level but the blog does assume basic
knowledge of C++, CUDA and compiling with nvcc.
For much of the year we have been working on a major project to upgrade
Apocrita to a new operating system. As part of the project, we have deployed a
new package building tool to help us recompile all of the research applications
to work on the new system. We are now calling for Apocrita users to preview and
test this new system, ahead of our full roll-out, to help bring about a smoother
and quicker transition.
This is an opportunity to check that your applications work on the new system,
and for us to address any issues before we fully roll it out.
Regular expressions, or regex, are patterns used to match strings of text.
They can be very useful for searching, validating, or manipulating text
efficiently. This guide will introduce the basics of regex with easy-to-follow
examples.
In this blog post, we explore what
torchrun and
DistributedDataParallel
are and how they can be used to speed up your neural network training by using
multiple GPUs.
If you go to run every morning, or drive to work on weekdays, you should know that every journey is unique.
For me, every High Performance Computing (HPC) workshop I deliver has its own personality.
The audience, the material tailored to each audience, the interactions and questions, and of course,
the energy of the community.
Last Thursday September 26, an HPC workshop for the Wolfson Institute of Population Health was held from 2:00 p.m. to 5:00 p.m.
The seminar includes, as usual, presentations, coffee break, quiz and treats, and the photographs to make it memorable.
On August the 9th, the High Performance Computing for the School of Engineering
and Material Science workshop was held at the Sofa Room at Dept. W.
Around 16 researchers who already use Apocrita attended the event.
The event covered six topics: Linux commands for Apocrita, HPC clusters at QMUL,
Launching HPC jobs, Applications for SEMS, Using GPUs, and Miscellaneous.
We still encounter jobs on the HPC cluster
that try to use all the cores on the node on which they're running, regardless
of how many cores they requested, leading to node alarms. Sometimes, jobs try
to use exactly twice or one-and-a-half the allocated cores, or even that number
squared. This was a little perplexing at first. In your enthusiasm to parallelize
your code, make sure someone else hasn't already done so.
On May 3, 2024 Queen Mary University of London conducted a workshop to
introduce our students to Linux at the Department W building in Whitechapel.
Students from a variety of programmes at Queen Mary attended the workshop.
Many students who participated are working towards Masters and PhD degrees.