Intel’s Incredible PIUMA Graph Analytics Hardware

The Intel PIUMA graph-optimized ASIC will focus on many RISC cores and fast random memory access. (from Figure 3 of the paper)

For the last few years, I have been promoting the idea of the Hardware Graph. My assertion was that graph hardware needs a focus on simple pointer hopping at scale. I have always stated is that the best way to do this is to use full-custom ASIC chips and memory designed for random memory access that supports pointer hopping over large memory footprints. Although I had privileged access to some insider knowledge of developments in this field, on October 13th, 2020, Intel published the results of their groundbreaking research on how they are building the next generation graph hardware. Now that this paper is in the public domain, we can openly discuss this new architecture and its impact on the Enterprise Knowledge Graph (EKG) industry. I hope to convince you the impact could be huge!

Intel’s name for this new architecture is PIUMA: Programmable Integrated Unified Memory Architecture. Although the word “memory” is prominent in the title, it is not only about optimizing memory hardware. It goes beyond that to include the design of new RISC cores. The instruction sets of these cores are optimized for graph traversal. There are many performance gains by using more lightweight cores with smaller instruction sets. When combined with memory turned for graph-access patterns the goal is a 1,000x speedup over other complex instruction set computers (CISC) for graph algorithm execution.

Inspired by the DARPA HIVE Project

The HIVE program is looking to build a graph analytics processor that can process streaming graphs 1,000X faster and at much lower power than current processing technology.

I think after you read the Intel PIUMA paper, you will agree that they are well on their way to meet and in some benchmarks, exceed these goals!

Before we dive into the predicted performance of this new hardware, a bit of background might be helpful for those of you not familiar with the concepts in graph hardware optimization.

Classification vs Connection

  1. Finding objects in an image (image classification)
  2. Finding entities (nouns) like people, places, and things in text documents (entity extraction)
  3. Finding words in speech (automatic speech recognition)
  4. Finding variations in a genomic sequence

However, once these items are discovered in unstructured data, there must be ways of connecting this information into coherent knowledge graphs. This is really the next step in AI. The recent NIPS conference papers have 136 references to the string “graph” in the titles. So clearly combining machine learning with graph representations of knowledge is a big trend in AI research.

Bottom line: modern AI must both classify and connect in real-time

Dense vs Sparse Analytics Workloads

Unlike classification problems, connection problems tend to best be represented by a graph. We use an existing knowledge graph of known connections and then we use both rules and machine learning to predict the new relationships. This is what the semantic web community calls “inference”: the discovery of new relationships within an existing graph. There are both deterministic rules (ontologies) and machine learning algorithms that work together to perform inference.

Historically, many academics working in “symbolic AI” focused on how to store ontologies in different forms (SKOS, OWL, etc.) and when to execute these rules. Most of the Semantic Web Stack is concerned with this area of research. Sometimes the results of the inference take a long time to compute and can be stored in new materialized edges or recalculated on demand. There are still many interesting topics in inference engines that will impact knowledge graphs in the future. However, they all boil down to graph traversals which is really just pointer hopping. No one has ever given me an inference rule that could not be converted to a graph query. Very often the rules a just a few lines of graph queries.

Bottom line: many problems in AI require sparse representations of knowledge

Example: Using GPUs to Store Knowlege Graphs

Bottom line: GPUs are not designed for graph inference and are inefficient at large knowledge graph representation.

Why CISC Architectures are Inefficient for Graph Traversal

If you take a look at the actual silicon real-estate in many CISC processors, you see they allocate a huge amount of silicon to optimize the performance of compiled programs. These programs have many IF/THEN/ELSE operations that need to run quickly. They are implemented as a comparison followed by a branch to other locations in code. CISC processors try to speculate on which branches will be executed and load memory references for different branches in parallel. If you are programming with a modern strongly typed language, every time you call a function, there is code that checks that the parameters are of the right type and in the right range of values. They boil down to IF (ERROR) THEN GOTO ERROR_HANDELER ELSE CONTINUE. As you can guess, going down the ERROR path is extremely rare, so a good compiler will not pre-fetch the error path.

Bottom line: most CISC processors today are heavily optimized for executing compiled code, not graph traversal.

The Top Graph Challenges

  1. Cache and Bandwidth — optimizing memory access patterns for random pointer hopping.
  2. Irregular Computation and Memory Intensity — creating many small cores that execute the pointer hops and are not constantly waiting for memory.
  3. Synchronization — if traversal needs to update memory within a core — how to get that update in to global memory efficiently.
  4. Scale — since a multi-trillion vertex graph will not fit in the RAM of a single server, how can information about traversals to other nodes be executed quickly.

Graphs Need Fast Random Memory Access

Linear Memory Scans Can Be Boosted with Caches

As the paper states:

…graph applications suffer from inefficient cache and bandwidth utilization

If you are doing a simple search for a string of text in a long unindexed document, you will be simply stepping through the document page after page. A modern CISC processor will work hard to pre-fetch pages and put them in a cache so when the CPU is ready to use that page, the memory access time will be fast.

Many relational databases are focused on getting many rows from tables. These rows are frequently laid out in memory next to each other and grouped together in “pages” of memory. Current CPUs use elaborate caching algorithms to predict what the next page of memory that the CPU will need. So complex caching silicon that helps with search and relational databases don’t help us with graph traversal.

Bottom line: Graph databases rarely need to do large memory scans. Elaborate caching hardware will not help graph algorithms.

As a side note, converting any low-cardinality attribute of a graph (like PersonGenderCode) into a vertex is a way of using outbound edges of a vertex like an index. If you need to filter all members by gender you just traverse the edges from the right gender. That avoids having to create secondary indexes that slow updates.

The paper continues to describe the results of their analysis of the graph analytics algorithms and how they are optimizing their hardware to maximize the performance of native graphs that use direct pointer-hops. For now, I am going to defer the analysis of these sections to a later blog and focus on the performance results of the hardware simulations.

Summarizing Performance Simulations

The paper uses a metric called Sparse Matrix Dense Vector Multiplication (SpMV) which is a good estimation of the types of calculations done in PageRank algorithms. The data is stored in a format called Compressed Sparse Row (CSR) or Yale format. The simulator shows 10x performance improvements even without taking into consideration the memory improvements. Once all the memory access simulation is done we get a 29x performance improvement on a single node. Then by extrapolating the simulation to a 16 node configuration we see a 467x speedup. By adding more nodes to the simulation we can see that getting to the DARPA HIVE goal of 1,000x is within reach.

The paper also provides simulation results for many other graph algorithms. One of the most impressive speedups is for Random Walks at 2,606x. Random walks are critical for calculating graph structures called graph embeddings — another topic that has become critical in graph analytics. Although the hardware simulator and the extrapolation is not an exact number, it is a good sign that Intel PIUMA is heading in the right direction.

Bottom line: Custom silicon could give graph analytics a 1,000x speed improvement over the existing hardware.

Next Steps and The Future

All these contenders will be pressed to provide not just low-level C-language interfaces, but they will need to leverage the new GQL standard to implement solutions with graph databases vendors. There is still much work to be done before it is easy to use the 1,000x improvements that the graph optimized architecture can offer on our Enterprise Knowledge Graphs.

Apology and Gratitude

I want to close with my thankfulness for both the people at DARPA that sponsored the HIVE challenge and the incredible team at Intel that is designing and building the new hardware. You may not have started your project scope with “let’s lower the cost of healthcare for everyone” but I truly believe that this will be one of the “unintended consequences” of your leadership. So thank you and good luck!

Distinguished Engineer with an interest in knowledge graphs, AI and complex systems. Big fan of STEM, Arduino, robotics, DonkeyCars and the AI Racing League.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store