STFC Machine Learning Group Deploys Elastic NVMe Storage to Power GPU Servers

19 November 2019    Source:

Storage architecture using Excelero NVMesh®, BeeGFS file system on Flash-IO Talyn® systems from Boston Ltd. reduces machine learning training time from three-to-four days to under an hour, enables new analysis.

SAN JOSE, CA November 18, 2019 - Excelero, a disruptor in software-defined NVMe storage, announced that the Science and Technology Facilities Council (STFC) has deployed a new high-performance computing (HPC) architecture to support computationally intensive analysis including machine learning and AI-based workloads using the NVMesh™ elastic NVMe block storage solution. The deployment, done in partnership with Boston Limited, a provider of high performance, mission-critical server and storage solutions, is enabling researchers from STFC and the Alan Turing Institute to complete machine learning training tasks that formerly took three to four days, in just one hour – and other foundational scientific computations that researchers formerly could not perform.

@STFC_Matters deploys @ExceleroStorage NVMesh on @BeeGFS parallel cluster file system for elastic #NVMe Flash using @bostonlimited Flash-IO Talyn systems http://bit.ly/2KAg89o Click Here To Tweet

 

The Science and Technology Facilities Council is a part of U.K. Research and Innovation (UKRI) and supports pioneering scientific and engineering research by over 1,700 academic researchers worldwide on space materials and life sciences, nuclear physics and much more.

Facility shot of the Rutherford Labs

Research involves a wide variety of data-rich analysis on data generated by large-scale experimental facilities and observatories. These include cryo-electron microscopy, synchrotron light, and other techniques. Workloads are massive – often hundreds of terabytes (TB) – and require both fast compute and fast storage. For example, a typical workload involves a significant amount of computing multimodal data, such as those obtained from X-ray and neutron sources.

Four images from STFC's cloud masking project using machine language to classifying satellite imagery

STFC’s Scientific Machine Learning (SciML) Group was established with the aim of enabling scientists to analyze large amounts of data, with the group bringing machine learning and AI expertise. The group routinely utilizes deep neural networks running on state of the art NVIDIA® DGX-2™ GPU computing systems located at the Scientific Data Centre at its Rutherford Appleton Laboratory site near Oxford. As the need for image processing expanded, the use of GPU-based workstations needed to be extended to support the high throughput and low latency required for end-user response times. Adding NVIDIA DGX-2 servers offered higher computational support, yet lacked the enterprise-level storage functionality required to scale out the resource across the hundreds of researchers.

High-performance computing solutions provider, Boston Ltd., worked with STFC to evaluate all-flash arrays and open systems-based storage options, and commissioned a benchmark of Excelero’s NVMesh for share NVMe Flash storage at local performance. High-performance computing solutions provider, Boston Ltd., worked with STFC to evaluate all-flash arrays and open systems-based storage options, and commissioned a benchmark of Excelero’s NVMesh for share NVMe Flash storage at local performance.

Boston Ltd.’s benchmark results showed the proposed STFC architecture delivered an average latency of 70 microseconds – nearly one-quarter of the typical 250-microsecond latency of traditional controller-based enterprise storage when running NVIDIA validation tests on each NVIDIA DGX-2 system. The combined NVMesh and BeeGFS deployment therefore showed potential for meeting STFC’s high throughput, low latency demands.

STFC’s storage architecture now includes two Boston Flash-IO Talyn® systems built on Supermicro building blocks, networked via a Mellanox 100G InfiniBand network to two NVIDIA DGX-2 computing systems, each with 16 NVIDIA 32GB V100 SXM modules.

Operational since July 2019, STFC’s storage architecture enabled running training sets that formerly took three to four days, in under an hour. With the BeeGFS file system providing a single name space to simplify management and virtualization, and the low latency and high throughput of its NVMesh system, STFC now has a GPU computing architecture where storage no longer presents a bottleneck, even with its complex research needs.

Backed by its new deployment, user communities surrounding this new system, including users from the Alan Turing Institute, are now able to carry out machine learning research projects covering a number of disciplines, including environment, life sciences, materials, space sciences and astronomy.

“In benchmark testing we quickly saw that our Flash-IO Talyn systems with the Excelero NVMesh software delivered a significant performance enhancement over traditional controller- based architectures found in all-flash arrays – and the ease of installation of a packaged solution,” said Matthew Parfitt, HPC commercial manager at Boston Ltd., who oversaw the deployment.

“Fundamental scientific research needs a clear computational path to completion, without the storage bottleneck that is endemic when NVMe resources are not virtualized,” said Lior Gal, CEO and co-founder of Excelero. “We’re proud that our NVMesh software helped our partner Boston Ltd. put an essential building block in place in the STFC architecture to support STFC’s vital initiatives.”

Excelero and Boston Ltd. both are showcasing their deployment at STFC along with deployments at other major HPC research facilities around the world at the SC19 Supercomputing event, Denver CO this week. Visit Excelero in booth #601, or online at www.excelero.com, and visit Boston Ltd. in booth #1849 or online at www.boston.co.uk.

RSS Feed

Sign up to our RSS feed and get the latest news delivered as it happens.

click here

Test out any of our solutions at Boston Labs

To help our clients make informed decisions about new technologies, we have opened up our research & development facilities and actively encourage customers to try the latest platforms using their own tools and if necessary together with their existing hardware. Remote access is also available

Contact us

HIPC 2019 - Hyderabad

Latest Event

HIPC 2019 - Hyderabad | 17th - 20th December 2019, Hyderabad International Convention Centre

The 26th IEEE International Conference on High Performance Computing, Data and Analytics (HiPC 2019), to be held in Hyderabad, India, from December 17th - December 20th, will serve as a forum for presenting current work by researchers from around the world and highlight activities in Asia, in the area of high performance computing.

more info