Posted on 10 July, 2023
In our latest instalment of the Boston Technical Blog, we have NVIDIA's much anticipated Grace Hopper Superchip. Make sure you have your cup of coffee ready and we'll get stuck in.
Let's breakdown the product name; Grace Hopper, was a pioneer in computer programming. She invented the first ever compiler and co-developed COBOL (an early high-level programming language) which is still used today, over 60 years later. "Superchip" is fairly self-explanatory - it's joining two impressive processors in a single package therefore Superchip. You've got to be highly confident to use a name like that, especially when it features your company's first ever datacentre CPU.
It's clear NVIDIA is not messing around with the name of this product. They've clearly earmarked it as cornerstone product perhaps not just for themselves but for the HPC and AI industry as a whole.
In a nutshell Grace Hopper is NVIDIA's heterogeneous (CPU, GPU & Memory) building block for large scale HPC & AI deployments. The key difference being that there is up to 150TB (256 Superchips) of peer memory accessible with Grace Hopper, whereas current deployments use distributed memory through traditional Ethernet networking and InfiniBand - a potentially huge bottleneck.
You may have already heard of NVIDIA's NVLink, currently on the 4th generation of this high-performance interconnect. Originally brought to market by NVIDIA to address PCIe bandwidth limitations when communicating between their class leading datacentre GPUs; each generation got faster, and now current bandwidth achieves blazing speeds of 900GB/s per GPU. This is some 7 times faster than traditional PCIe Gen 5 speeds, whilst being 5x more energy efficient than PCIe Gen 5 also.
This formerly NVIDIA GPU exclusive interconnect technology has now been opened up to allow chip-to-chip (C2C) communication. This isn't just for the Grace Hopper Superchip but also open for semi-custom silicon-level integration as future designs become increasingly accelerated and chiplet based.
NVLink-C2C is at the heart of Grace Hopper's design, this allows direct access to 512GB of LPDDR5X CPU memory which is both high bandwidth and power efficient. This memory-coherent design between CPU and GPU is what allows a truly heterogeneous platform and is the first of its kind.
Grace Hopper Superchip Architecture
NVIDIA's Grace CPU is their first release of a data centre CPU featuring 72 Arm Neoverse V2 cores; ARM's maximum performance core design. As previously mentioned, the Grace CPU has 512GB of coherent memory and as it's LPDDR5X it is both energy efficient and fast with 546 GB/s of memory bandwidth per CPU. When compared against a traditional 8 channel DDR5 design, Grace's memory provides up to 53% more bandwidth at a fraction of the power.
The GPU being paired here is Hopper, NVIDIA's 9th latest and greatest generation of data centre GPU. It is already in huge demand currently as an AI trend has taken hold of the market. Hopper's architecture also had a first to market in the form of HBM3 memory, of which 96GB can be found NVIDIA's Grace Hopper Superchip, allowing for a screaming 3TB/s of memory bandwidth. Hopper also benefits from more Streaming Multiprocessors, higher frequency and new 4th Gen Tensor cores; all of this can be leveraged by the new Transformer engine for a potential 6x throughput over A100, NVIDIA's previous flagship data centre GPU.
Diagram of this architecture and key features below:
Having hardware designed memory coherency means that developers have a much simpler time when provisioning memory. CPU and GPU threads can transparently and concurrently access both CPU & GPU local memory with ease. When scaling to neighbouring Grace Hopper Superchips NVIDIA's 4th Gen NVLink Network is used to access peer memory, enabling it to solve the world's largest computing problems faster than ever.
NVLink Switch System
NVIDIA's 4th generation NVLink Switch System can directly connect up to 8 Grace Hopper superhips per NVSwitch. Additionally, a 2nd level in a fat-tree topology allows up to 256 Grace Hopper superchips to be networked together. When all 256 superchips are connected, this network can deliver up to a staggering 115.2 TB/s all-to-all bandwidth. This is 9x greater than the all-to-all bandwidth of the NVIDIA Infiniband NDR400.
NVIDIA's 4th generation NVLink allows GPU threads to address up to 150TB of memory from all Grace Hopper superchips in the network for normal memory operations, bulk transfers and atomic operations. This also allows communication libraries such as MPI, NVSHMEM and NCCL to transparently leverage the NVLink Switch System.
NVIDIA calls this feature Extended GPU Memory (EGM). Specifically designed for applications with massive memory footprints - bigger than local (HBM3 + LPDDR5X) capacity of a single Grace Hopper superchip. EGM allows GPU threads to access all memory resources at speeds of 450 GB/s over the NVSwitch fabric, for both HBM3 and LPDDR5X.
What are the targeted workloads for acceleration?
We've already discussed Grace Hopper's main design features of its C2C interconnect and how these superchips extend via NVLink, but not yet its 1:1 GPU to CPU ratio. This heterogeneous design is uniquely positioned for HPC (fluid dynamics, weather/ climate modelling and molecular dynamics), machine learning (recommender systems, natural language processing and Graph Neural Networks) and databases.
Below you cn find a speed up for end user applications in these above-mentioned fields, Grace Hopper is compared vs a traditional x86 + Hopper system.
All of these applications can make use of concurrent CPU and GPU processing, we'll take a closer look at a selection of these below.
Database workloads have remarkably large input tables that will not fit in GPU memory; therefore, performance is often constrained by the CPU - GPU data transfer via the PCIe link. Grace Hopper's NVLink C2C alleviates this bandwidth limitation as both CPU and GPU concurrently build a shared hash table for join and group. This leverages access of both HBM3 and LPDDR5X memory sites which are hardware coherent.
Below we can see performance comparisons between HGX Grace Hopper and a traditional x86 + Hopper system. Performance simulations for Hash Join with input tables CPU memory (left) and host to device transfer of pageable host - resident memory (right).
The speedups are not just through raw bandwidth limitations, on platforms without Address Translation Services (ATS) transfer must be done via a host-pinned buffer for correctness. Grace Hopper's design therefore also simplifies the process workflow without further application-side changes.
Natural Language Processing
AI and Large Language Models (LLM) have been dominating the headlines of recent months, they have been growing in size and complexity at a rapid pace over the last few years. Open AIs ChatGPT uses their most recent GPT-4 model, this has reportedly hit the milestone 1 trillion mark in parameters. There are many competing models with parameters in the hundreds of billions also. All of these massive models are trained on very large datasets using huge GPU clusters over the course of months. As you may imagine, this is an increadibly costly process.
To get better quality answers from an LLM, prompt engineering techniques are used - but this would take a huge amount of time when catering for hundreds of billions of parameters. A more efficient technique for LLMs is to utilise P-tuning (prompt tuning) where a much smaller model is tuned before the LLM. P-tuning saves a lot of time and resources as it can be done in a matter of hours, rather than months. The results of this P-tuning are saved as virtual tokens in a lookup table for inference, replacing the smaller model.
Training is much faster with these smaller models and datasets; this allows for iterative re-training of Natural Language Processing (NLP) tasks that evolve over time. P-tuning whilst being less resource intensive does still benefit from fast memory bandwidth for tensor offloading. On x86 systems accessing the necessary system memory is limited by the PCIe link, but with the Grace Hopper superchip the NVLink C2C gives high speed access to LPDDR5X memory. This therefore greatly reduces tensor offloading execution time when P-tuning compared to a x86 + Hopper system, as the example below shows with a GPT-3 175B Model.
NVIDIA DGX GH200
NVIDIA also announced their DGX H200 which is essentially their reference blueprint for massive scalability for the worlds largest HPC and AI workloads. The DGX GH200 features 256 Grace Hopper superchips with a slightly different mixture of memory than previously mentioned, 96GB of HBM3 and 480GB of LPDDR5. NVIDIA's 4th Gen NVLink Switch System allows all 144TB of memory to be accessible to the GPUs within the network. The GH200 is the first supercomputer to break past the 100TB barrier for memory access to a GPU.
Baseboards housing the Grace Hopper superchips are connected to the NVLink Switch System using a custom cable harness for the 1st NVLink layer fabric. LinkX cables then extend connectivity in the 2nd layer to 36 NVLink Switches - as shown below.
For every Grace Hopper superchip in the DGX GH200 there is also a NVIDIA ConnectX-7 network adapter and a NVIDIA Bluefield-3 NIC. For solutions where you need to scale beyond 256 GPUs, the ConnectX-7 adapters can interconnect multiple DGX GH200 systems for an even larger supercomputer. The inclusion of Bluefield-3 DPUs allows organisations to run applications in multi-tenant, secure environments.
A lot of HPC and AI workloads can fit within the aggregate GPU memory of a single DGX H100, in such cases the DGX H100 is still the most performant solution.
However, many AI & HPC models are requiring massive memory capacity to house their workloads, this is what the DGX GH200 is purpose built for and truly excels at. The speedups of which are demonstrated below.
NVIDIA aims to make DGX GH200 available at the end of this year. But the HGX Grace Hoper Superchip platform will be available in the coming months here at Boston. Pictured below is Supermicro's ARS-221GL-NR supporting NVIDIA's Grace Hopper Superchip which we will have available soon. Full system specifications are yet to be confirmed - we will update asap; please get in touch to register your interest now as demand is expected to be huge.
Sonam Lama - Senior Field Application Engineer