Intel’s Incredible PIUMA Graph Analytics Hardware
For the last few years, I have been promoting the idea of the Hardware Graph. My assertion was that graph hardware needs a focus on simple pointer hopping at scale. I have always stated is that the best way to do this is to use full-custom ASIC chips and memory designed for random memory access that supports pointer hopping over large memory footprints. Although I had privileged access to some insider knowledge of developments in this field, on October 13th, 2020, Intel published the results of their groundbreaking research on how they are building the next generation graph hardware. Now that this paper is in the public domain, we can openly discuss this new architecture and its impact on the Enterprise Knowledge Graph (EKG) industry. I hope to convince you the impact could be huge!
Intel’s name for this new architecture is PIUMA: Programmable Integrated Unified Memory Architecture. Although the word “memory” is prominent in the title, it is not only about optimizing memory hardware. It goes beyond that to include the design of new RISC cores. The instruction sets of these cores are optimized for graph traversal. There are many performance gains by using more lightweight cores with smaller instruction sets. When combined with memory tuned for graph-access patterns, the goal is a 1,000x speedup over other complex instruction set computers (CISC) for graph algorithm execution.
Inspired by the DARPA HIVE Project
Most enterprise knowledge graph companies are focused on getting the most out of conventional commodity hardware that is configured to support high core counts, large terabyte RAM, and a high-bandwidth low-latency network interconnecting the servers in a cluster. Because the market for enterprise knowledge graphs is so new, these companies don’t have the budget to build their own full-custom ASICs to optimize their algorithms. But with the prompting of the DARPA HIVE program, Intel does have the resources and expertise to design and build graph optimized hardware. And to be clear, the goals of the DARPA HIVE project are impressive:
The HIVE program is looking to build a graph analytics processor that can process streaming graphs 1,000X faster and at much lower power than current processing technology.
I think after you read the Intel PIUMA paper, you will agree that they are well on their way to meet and in some benchmarks, exceed these goals!
Before we dive into the predicted performance of this new hardware, a bit of background might be helpful for those of you not familiar with the concepts in graph hardware optimization.
Classification vs Connection
The paper points out that many of the recent developments in AI and machine learning are focused on the identification of objects in unstructured data. This is referred to as object classification. This includes examples like:
- Finding objects in an image (image classification)
- Finding entities (nouns) like people, places, and things in text documents (entity extraction)
- Finding words in speech (automatic speech recognition)
- Finding variations in a genomic sequence
However, once these items are discovered in unstructured data, there must be ways of connecting this information into coherent knowledge graphs. This is really the next step in AI. The recent NIPS conference papers have 136 references to the string “graph” in the titles. So clearly combining machine learning with graph representations of knowledge is a big trend in AI research.
Bottom line: modern AI must both classify and connect in real-time
Dense vs Sparse Analytics Workloads
The Intel PIUMA paper mentions that this new hardware is optimized for “sparse” workloads as opposed to “dense” workloads. But it does not go into much detail about where these workloads come from and how they are different. In general, classification problems are often solved using dense matrix representations of unstructured data. By dense, we mean that most of the values in the matrix are non-zero. For example, converting an image into a matrix would mean that every point in the image has non-zero numbers for the grayscale or color values at each point.
Unlike classification problems, connection problems tend to best be represented by a graph. We use an existing knowledge graph of known connections and then we use both rules and machine learning to predict the new relationships. This is what the semantic web community calls “inference”: the discovery of new relationships within an existing graph. There are both deterministic rules (ontologies) and machine learning algorithms that work together to perform inference.
Historically, many academics working in “symbolic AI” focused on how to store ontologies in different forms (SKOS, OWL, etc.) and when to execute these rules. Most of the Semantic Web Stack is concerned with this area of research. Sometimes the results of the inference take a long time to compute and can be stored in new materialized edges or recalculated on demand. There are still many interesting topics in inference engines that will impact knowledge graphs in the future. However, they all boil down to graph traversals which is really just pointer hopping. No one has ever given me an inference rule that could not be converted to a graph query. Very often the rules a just a few lines of graph queries.
Bottom line: many problems in AI require sparse representations of knowledge
Example: Using GPUs to Store Knowlege Graphs
GPU hardware is best suited for transforming dense information. The native data structure of the GPU is a matrix of numerical values. If most of the data in a matrix are non-zero, you can efficiently transform it with a GPU. GPUs are designed with many small cores that work together on matrix transforms. Yet many forms of knowledge don’t fit efficiently in a matrix. For example to represent a graph in a matrix we use a data representation called an adjacency matrix. Imagine a matrix where each vertex has a row and a column. If two vertices are connected we put a “1” at the cell at that row and column. If not, we put a “0” in that cell. For a typical knowledge graph with a million vertices, we might have on the average of five connections per vertex. That means that only 0.0005% of the matrix has non-zero values. This is an inefficient way to store knowledge. For small graphs of a few hundred vertices, using an adjacency matrix representation is not a big problem. But the larger your graphs the more inefficient GPUs become.
Bottom line: GPUs are not designed for graph inference and are inefficient at large knowledge graph representation.
Why CISC Architectures are Inefficient for Graph Traversal
Before we go into the Intel PIUMA architecture, there are a few items we need to cover to give readers a better understanding of why CISC processors are not appropriate for efficient graph pointer hopping needed in modern native property graphs.
If you take a look at the actual silicon real-estate in many CISC processors, you see they allocate a huge amount of silicon to optimize the performance of compiled programs. These programs have many IF/THEN/ELSE operations that need to run quickly. They are implemented as a comparison followed by a branch to other locations in code. CISC processors try to speculate on which branches will be executed and load memory references for different branches in parallel. If you are programming with a modern strongly typed language, every time you call a function, there is code that checks that the parameters are of the right type and in the right range of values. They boil down to IF (ERROR) THEN GOTO ERROR_HANDELER ELSE CONTINUE. As you can guess, going down the ERROR path is extremely rare, so a good compiler will not pre-fetch the error path.
Bottom line: most CISC processors today are heavily optimized for executing compiled code, not graph traversal.
The Top Graph Challenges
Intel clearly has done their homework. They have done a detailed analysis of many of the most common graph algorithms like PageRank and looked at how new hardware can be optimized to make execution fast. They summarize the top challenges to graph traversal via pointer chasing to random memory locations.
- Cache and Bandwidth — optimizing memory access patterns for random pointer hopping.
- Irregular Computation and Memory Intensity — creating many small cores that execute the pointer hops and are not constantly waiting for memory.
- Synchronization — if traversal needs to update memory within a core — how to get that update in to global memory efficiently.
- Scale — since a multi-trillion vertex graph will not fit in the RAM of a single server, how can information about traversals to other nodes be executed quickly.
Graphs Need Fast Random Memory Access
Graph algorithms don’t need the hardware complexity needed to optimize many diverse memory access patterns. A graph query is simply moving the program counter to new addresses in memory and then following simple rules about what paths to take next based on the properties of the vertex or the edges. To be efficient, graph databases need to quickly assign a traversal to a thread of execution in any core processor and let the cores skip through memory. Every transistor that doesn’t contribute to pointer hopping gets in the way. The challenge is that this memory access pattern is complicated to predict since graph traversal really looks like continuous access to random addresses. So great graph hardware needs to be optimized for different memory access patterns.
Linear Memory Scans Can Be Boosted with Caches
Imagine that you are searching your hard drive for a file with a specific keyword. Your computer needs to bring each document into RAM and search the RAM for a string. This is called “memory scans” and it is very common in unindexed database searches.
As the paper states:
…graph applications suffer from inefficient cache and bandwidth utilization
If you are doing a simple search for a string of text in a long unindexed document, you will be simply stepping through the document page after page. A modern CISC processor will work hard to pre-fetch pages and put them in a cache so when the CPU is ready to use that page, the memory access time will be fast.
Many relational databases are focused on getting many rows from tables. These rows are frequently laid out in memory next to each other and grouped together in “pages” of memory. Current CPUs use elaborate caching algorithms to predict what the next page of memory that the CPU will need. So complex caching silicon that helps with search and relational databases don’t help us with graph traversal.
Bottom line: Graph databases rarely need to do large memory scans. Elaborate caching hardware will not help graph algorithms.
As a side note, converting any low-cardinality attribute of a graph (like PersonGenderCode) into a vertex is a way of using outbound edges of a vertex like an index. If you need to filter all members by gender you just traverse the edges from the right gender. That avoids having to create secondary indexes that slow updates.
The paper continues to describe the results of their analysis of the graph analytics algorithms and how they are optimizing their hardware to maximize the performance of native graphs that use direct pointer-hops. For now, I am going to defer the analysis of these sections to a later blog and focus on the performance results of the hardware simulations.
Summarizing Performance Simulations
The paper concludes with a comparison of a PIUMA node with a current Xeon Gold 6140 processor with 4 sockets and 18 cores per socket. The PIUMA core will have 256 PIUMA “blocks”, and each block will have a number of both multi-threaded and single-threaded cores. Doing an exact comparison of thread counts may not be very meaningful since the cores are very specialized. But in general, we are going from 144 on the Xeon node to 16K threads on the PIUMA node. The key to performance speedups they are predicting from their hardware simulations are not just high thread counts, but a combination of high thread counts with enhanced memory and network hardware.
The paper uses a metric called Sparse Matrix Dense Vector Multiplication (SpMV) which is a good estimation of the types of calculations done in PageRank algorithms. The data is stored in a format called Compressed Sparse Row (CSR) or Yale format. The simulator shows 10x performance improvements even without taking into consideration the memory improvements. Once all the memory access simulation is done we get a 29x performance improvement on a single node. Then by extrapolating the simulation to a 16 node configuration we see a 467x speedup. By adding more nodes to the simulation we can see that getting to the DARPA HIVE goal of 1,000x is within reach.
The paper also provides simulation results for many other graph algorithms. One of the most impressive speedups is for Random Walks at 2,606x. Random walks are critical for calculating graph structures called graph embeddings — another topic that has become critical in graph analytics. Although the hardware simulator and the extrapolation is not an exact number, it is a good sign that Intel PIUMA is heading in the right direction.
Bottom line: Custom silicon could give graph analytics a 1,000x speed improvement over the existing hardware.
Next Steps and The Future
As the paper mentions, Intel is still early in the design and fabrication steps of the new PIUMA architecture. Getting the silicon working and then all the software developed as well as porting graph databases to use the new hardware will take time. However, I think that any hardware company in the AI space will need to compete against Intel PIUMA to be a contender. The contenders in the hardware graph space include Graphcore, NVIDIA/ARM merger, and now AMD/Xilinx efforts after their merger next year. All of these firms have teams of engineers that appreciate the need for non-matrix parallel computation. Each will need to invest hundreds of millions of dollars to build full-custom ASICs optimized for pointer hopping. This new hardware will compete with the existing high-performance computing vendors like Cray Graph Engine (now part of HP Enterprise) but cost a tiny fraction of these systems.
All these contenders will be pressed to provide not just low-level C-language interfaces, but they will need to leverage the new GQL standard to implement solutions with graph databases vendors. There is still much work to be done before it is easy to use the 1,000x improvements that the graph optimized architecture can offer on our Enterprise Knowledge Graphs.
Apology and Gratitude
I first want to mention I should have got this blog out several weeks ago after the paper came out. I was in the middle of rolling out a new graph training curriculum and I got quite behind on my blogging efforts.
I want to close with my thankfulness for both the people at DARPA that sponsored the HIVE challenge and the incredible team at Intel that is designing and building the new hardware. You may not have started your project scope with “let’s lower the cost of healthcare for everyone” but I truly believe that this will be one of the “unintended consequences” of your leadership. So thank you and good luck!