rse¶

June 15, 2026
in rse
21 min read

Modernising Rcpp and C++ Code With `std::span`

The programming language C++ is still widely used today, especially for high-performance computing. Out-of-date practices from the 80s and 90s should still work now because the language is designed to be highly backwards compatible. As C++ evolved, there were numerous efforts to modernise the language and update programming practices with new features to improve the C++ experience, such as memory-safety features.

October 1, 2025
in rse
25 min read

Poisson-Icing 🐟❄️ - Gibbs Sampling with a GPU using CuPy

Hybrid programming allows you to program the majority of your software in your favourite language but performance-critical parts in a faster language. With the Python package CuPy, you can program CPU code in Python and custom GPU kernel functions in CUDA. Thus, we can design our software with a familiar Python interface but run faster GPU code under the hood.

CuPy also has GPU versions of existing NumPy functions which may help transition your CPU code to the GPU without modifying your code too much. This may also help structure your GPU code with familiar NumPy functions, making it readable to many Python users.

June 2, 2025
in rse
19 min read

Python GPU Programming with Numba and CuPy

In a previous blog, we looked at using Numba to speed up Python code by using a just-in-time (JIT) compiler and multiple cores. The speed-up is remarkable with small changes to the existing code.

In this blog post, we will continue exploring the Numba ecosystem and implement the Gauss map on the GPU, gaining further speed up while still writing Python code. We will also look at CuPy which is another way to write and run GPU code. Instead of being Pythonic, it allows hybrid programming where the GPU code is written in CUDA but executed in Python.

One key advantage of hybrid programming is writing software in your preferred language while optimising performance-critical sections in a faster language - the best of both worlds!

March 21, 2025
in rse
3 min read

R on Rocky 9

With the major operating system upgrade from Centos 7 to Rocky 9, we want to ensure that using R, RStudio, and Open OnDemand (OOD) is as seamless as possible. This post will include new tips for a better experience, as well as a reiteration of the important or frequently forgotten old tips.

January 17, 2025
in rse
8 min read

A PyTorch DDP Case Study With ImageNet

In this blog post, we will play about with neural networks, on a dataset called ImageNet, to give some intuition on how these neural networks work. We will train them on Apocrita with DistributedDataParallel and show benchmarks to give you a guide on how many GPUs to use. This is a follow on from a previous blog post where we explained how to use DistributedDataParallel to speed up your neural network training with multiple GPUs.

December 3, 2024
in rse
10 min read

Unification of Memory on the Grace Hopper Nodes

The delivery of new GPUs for research is continuing, most notable is the new Isambard-AI cluster at Bristol. As new cutting-edge GPUs are released, software engineers are tasked with being made aware of the new architectures and features these new GPUs offer.

The new Grace-Hopper GH200 nodes, as announced in a previous blog post, consist of a 72-core NVIDIA Grace CPU and an H100 Tensor Core GPU. One of the key innovations is the NVIDIA NVLink Chip-2-Chip (C2C) and unified memory, which allows fast and seamless automation of transferring data from CPU to GPU. It also allows the GPU to be oversubscribed, allowing it to handle data much larger than it can host, potentially tackling out-of-GPU memory problems. This allows software engineers to focus on implementing algorithms without having to think too much about memory management.

This blog post will demonstrate manual GPU memory management and introduce managed and unified memory with simple examples to illustrate its benefits. We'll try and keep this to an introductory level but the blog does assume basic knowledge of C++, CUDA and compiling with nvcc.

October 4, 2024
in rse
12 min read

A Short Guide to PyTorch DDP

In this blog post, we explore what torchrun and DistributedDataParallel are and how they can be used to speed up your neural network training by using multiple GPUs.

July 12, 2024
in rse
7 min read

A Slight Case of Overthreading

We still encounter jobs on the HPC cluster that try to use all the cores on the node on which they're running, regardless of how many cores they requested, leading to node alarms. Sometimes, jobs try to use exactly twice or one-and-a-half the allocated cores, or even that number squared. This was a little perplexing at first. In your enthusiasm to parallelize your code, make sure someone else hasn't already done so.

February 22, 2024
in rse
21 min read

Some Pleasingly Parallel GPU Case Studies in Machine Learning

In a previous blog, we discussed ways we could use multiprocessing and mpi4py together to use multiple nodes of GPUs. We will cover some machine learning principles and two examples of pleasingly parallel machine learning problems. Also known as embarrassingly parallel problems, I rather call them pleasingly because there isn't anything embarrassing when you design your problem to be run in parallel. When doing so, you could launch very similar functions to each GPU and collate their results when needed.

January 26, 2024
in rse
12 min read

Some Strategies for Using Multiple Nodes of GPUs

Using multiple GPUs is one option to speed up your code. On Apocrita, we have V100, A100 and H100 GPUs available, with up to 4 GPUs per node. On other compute clusters, JADE2 has 8 V100 GPUs per node and Sulis has 3 A100 GPUs per node. If your problem is pleasingly parallel, you can distribute identical or similar tasks to each GPU on a node, or even on multiple nodes.