NVIDIA Enters the CPU Market
How a company known for GPUs could help build better EKG hardware
On March 22, NVIDIA announced that it is entering the CPU market with a new 144-core chip with high-speed interfaces. This blog post will discuss the implications for the Enterprise Knowledge Graph (EKG) industry.
In the past, I have written extensively about the next-generation Hardware Optimized Graph (HOG) processors and how they are making older relational databases obsolete. My HOG analysis shows that distributed EKGs that run on hardware customized to perform fast pointer hopping operations can be eight orders-of-magnitude faster than JOIN operations on relational databases. The new NVIDIA Grace CPUs could allow NVIDIA to be a key player in the disruptive EKG industry. But it is still too early to tell how far NVIDIA will go to be a player in the future EKG-optimized hardware space. Let’s see why.
NVIDIA Grace CPU Specifications
Each Grace core is based on the ARM v9 RISC architecture. Seeing NVIDIA use ARM cores indicates that the two companies are still working closely despite the failed merger between ARM and NVIDIA, which I wrote about in September 2020. The deal was killed due to regulatory concerns.
The Grace CPU chip could be an important development for the graph database industry. Core count and I/O rates are more important than compatibility with the 1,500+ instructions in the older Intel CISC processors for graph traversal performance. Since backward compatibility is not a design consideration for NVIDIA, they are free to look at some green-field ideas. Their selection of ARM RISC cores was a good design decision. RISC processors use fewer instructions to achieve dramatic performance improvements but with the need to recompile the software.
The new CPU is called the Grace Superchip. It is named after the computer pioneer Grace Hopper. Hopper was the first to promote machine-independent programming languages such as COBOL. Her groundbreaking work allowed programmers to move their programs between different computer systems and avoid vendor lock-in. For example, with high-level languages, we can move our code from CISC to RISC by just recompiling our software. We need to remember that at the time, most computer software was written in assembly language that was wired directly to the instruction set of the computer manufacturer. It is fitting that Hopper’s work at breaking vendor lock-in also applies to building EKG-optimized hardware.
Background on the ARM V9 Cores
ARM is a UK-based company that provides intellectual property (IP) for individual cores. ARM has been working on the instruction set and compilers for over 20 years and has a long lead over most other RISC core systems. ARM cores are now used widely in Apple Mac chips, Raspberry Pi, NVIDIA Nano, and high-performance computing (HPC) systems. In short, ARM cores are now ubiquitous across the computing industry. It makes perfect sense for NVIDIA to select the ARM V9 cores in their first CPU chips.
Because NVIDIA has used ARM cores in their NVIDIA Jetson embedded edge processors, they already had a strong working relationship with ARM going back to 2018. When I co-founded the AI Racing League, we used Jetson Nano processors in almost all of our events. They had excellent integration of the ARM cores and the NVIDIA GPUs. So NVIDIA has plenty of experience building tools that compile their libraries to work with the ARM instruction sets. We should note that these ARM cores were Cortex-A57 MPCores and were used for neural network inference, not training deep-learning models.
NVLink: Preventing Memory Starvation
Although putting 144 cores on each chip is a challenging engineering effort, many graph algorithms will not show a significant improvement if 90% of the cores are sitting idle waiting for the next chunk of memory. Chip designers need specialized hardware and software to quickly move data in and out of the cores without blocking to get around the memory starvation problems.
NVIDIA’s answer to this is their NVLink system.
Initially introduced by NVIDIA in 2014, NVLink is now in its 4th generation. It was originally introduced as a faster alternative to the PCI-Express standard for allowing GPUs to communicate directly with each other. Each generation of NVLink has added new features such as shared memory, making it easier for developers to quickly send data between CPUs and servers without going up and down the TCP/IP protocol stacks. NVLink was originally designed to permit fast point-to-point communication between GPUs. Now it is also being used to communicate from CPU to GPU. Although NVLink is not widely used outside of NVIDIA, its integration directly into Grace could give NVIDIA some advantages.
Suppose your EKG use cases are focused on doing singleton-type queries (return all the information on customer X). In that case, these mostly embarrassingly parallel queries might not make fast I/O hardware a requirement. EKG architects need to understand the capabilities of NVLink and be able to do cost-analysis to see if these hardware-based networking devices are a good investment.
Understand Query Data Access Patterns
Most general EKGs need a mix of singleton and cross-population queries. The latency and bandwidth questions are critical for a good hardware platform, as illustrated in the figure below.
By Singleton Queries, we mean queries that access a minimal subset of the EKG, such as all the information about a single customer, patient, or product. All the vertices about a single customer (demographics, purchases, touchpoints, etc.) might be located on a single Graph Data server. These queries don’t typically need to move data between the servers that store the data. These are typical of Customer 360 driven knowledge graphs.
By Population Queries, we mean queries that require a high degree of low-latency communication between graph data nodes. For example, when you are looking for complex patterns in your graph, the queries need to span different parts of the graph. Graph queries such as random-walks for creating graph embeddings frequently cross server boundaries. Without strong CPU-to-CPU networking, these queries will slow down the servers with mundane tasks like moving data in and out of TCP/IP protocol stacks.
Does NVIDIA Understand the Strategic Nature of EKGs?
Many of my readers know that I firmly believe that machine learning alone will not be sufficient to build systems that have true artificial general intelligence (AGI). EKGs with AIG will be used to interact with our valued customers cost-effectively. Hundreds of papers are being published annually about combining graph-based symbolic processing with a high degree of semantics and explainability. Combining ML with large knowledge graphs is the future. And from my last blog, we quote Jeff Hawkins that representation is always the hardest part of AI.
I believe that the strategic planners at NVIDIA now understand this. They need to be more than just a SIMD-focused GPU company. The following slide from the recent keynote shows that Graph knowledge representations will play a key role alongside their traditional ML tools.
The new Grace CPU is only the first step on NIVIDA's journey to AGI. Eventually, the cores will be optimized to focus on graph traversal, and we will get more than 144 cores per chip. Although I can clearly see NIVIDA’s direction, I can’t give you the dates. We don’t want to confuse a clear view with a short distance.
Summary: Too Soon to See if Grace is Cost-Effective EKG Hardware?
So the question to ask is, can the Grace CPUs be used to accelerate graph queries? The answer is both yes and no.
Yes, in that graph workloads need lots of cores and fast memory paths. Grace has both of these things. But the other answer is still a no. The ARMv9 cores are not yet customized to do pointer-hopping. We still have a lot of wasted silicon real estate. The ARMv9 are general-purpose cores for doing a wide variety of processing. As a result, we can’t say that the Grace has been customized for graph workloads.
However, for NVIDIA’s first entry into the CPU market, I think that the specifications for Grace are halfway there. They meet the criteria for higher-core counts than older CISC architectures, and they support high bandwidth/low-latency CPU-to-CPU communication.
The last two questions remain:
- Will future Grace processors be optimized for graph traversal?
- Will Grace CPUs be priced competitive with future chip architectures from Intel and RISC-V optimized for graph traversal?
What is hard to predict is how expensive Grace CPUs will be. Consumer-grade CPU chips that sell in high-volumes can be priced at a few hundred dollars each. But the Grace CPUs are designed with data centers in mind. Chips sold to data centers often imply lower initial sales volumes and higher per-chip prices. It can take years for data center software to be updated to use new chip designs. Data centers need proven cost savings data before they begin volume purchases.
The other question is, when will pricing be available? Any chip of this complexity might take a few hardware iterations to get the bugs worked out and discover the memory bottlenecks. So it is somewhat hard to predict when we will be able to get evaluation units for benchmarking. NVIDIA indicated that the Grace CPU will be available for purchase sometime in 2023. It is not clear how optimized this version will be.
Regardless of the answers to this question, I am happy to see more competition in the high-core-count, high-bandwidth CPU marketplace. It can only get better from here!