Skip to content

tutorial

Using job statistics to increase job performance and reduce queueing time

You may wonder why some jobs start immediately but some wait in the queue for hours or days, even if your job is quite simple. If you notice your job has been queueing for a while, you may want to consider adjusting the requested resources to reduce queueing time and reduce any potential resource wastage as the job runs. Below, we outline two useful tools for you to check the resource usage of previous jobs.

File Permissions

An understanding of file permissions is important to the success of computational jobs, and the security of your files.

The default settings are suitable for some, but not every use-case: without sufficient awareness, your files may be visible to people who should not be able to access them, and vice-versa.

Running Machine Learning workloads on Apocrita

In this tutorial we'll be showing you how to run a TensorFlow job using the GPU nodes on the Apocrita HPC cluster. We will expand upon the essentials provided on the QMUL HPC docs site, and provide more explanation of the process. We'll start with software installation before demonstrating a simple task and a more complex real-world example that you can adapt for your own jobs, along with tips on how to check if the GPU is being used.

A guide to using Apocrita's scratch storage

The Apocrita scratch storage is a high performance storage system designed for short-term file storage, such as working data. We recently replaced the hardware that provides this service, and expanded the capacity from 250TB to around 450TB. This article will look at the recent changes, and suggest some best practices when using the scratch system.

SSH authentication and regaining access to Apocrita

In response to a coordinated security attack on HPC sites world-wide, it has been necessary to implement some changes to enforce a higher level of authentication security. In this article, we begin with providing some useful information to understand key-based authentication, and document the process for regaining access to the cluster; SSH keys and passwords were revoked for all users as a precautionary measure.

Productivity tips for Apocrita cluster users

This article presents a selection of useful tips for running successful and well-performing jobs on the QMUL Apocrita cluster.

In the ITS Research team, we spend quite a bit of time monitoring the Apocrita cluster and checking jobs are running correctly, to ensure that this valuable resource is being used effectively. If we notice a problem with your job, and think we can help, we might send you an email with some recommendations on how your job can run more effectively. If you receive such an email, please don't be offended! We realise there are users with a range of experience, and the purpose of this post is to point out some ways to ensure you get your results as quickly and correctly as possible, and to ease the learning curve a little bit.

Sizing your Apocrita jobs for quicker results

At any one time, a typical HPC cluster is usually full. This is not such a bad thing, since it means the substantial investment is working hard for the money, rather than sitting idle. A less ideal situation is having to wait too long to get your research results. However, jobs are constantly starting and finishing, and many new jobs get run shortly after being added to the queue. If your resource requirements are rather niche, or very large, then you will be competing with other researchers for a more scarce resource. In any case, whatever sort of jobs you run, it is important to choose resources optimally, in order to get the best results. Using fewer cores, although increasing the eventual run time, may result in a much shorter queuing time.