Cluster update summary¶
As part of our commitment to providing stable and manageable systems, here is a round-up of some recent updates we have been working on behind the scenes:
1) Upgrade of all HPC cluster nodes to CentOS 7.6
Over the last couple of weeks, you may have noticed a few nodes in disabled
or maintenance state when running the nodestatus
command. We have been rolling
out an operating system update from CentOS 7.4 to 7.6, which provides essential
security updates, and some other fixes. We run the operating system update on
each node as an exclusive cluster job, followed by benchmarks and functionality tests,
before bringing the node back online. This allows us to perform essential
updates with the minimum of disruption.
2) GPU additions and CUDA driver updates
We recently purchased an additional 4 Nvidia V100 GPUs to keep up with demand for GPU acceleration. These were added to the sbg nodes to make 4 GPUs per server. Additionally, the CentOS 7.6 update allowed us to update the CUDA drivers to version 10.0, which has provided performance improvements, and allows use of the latest TensorFlow and MATLAB versions.
3) sdv node firmware updates
Our sdv nodes were quite new on the market when they were purchased, and as a result, have received quite a few firmware updates to address a variety of bugs, security updates, performance and hardware compatibility issues. We will be applying these over the next couple of weeks, as nodes become available.