From Lakes to Hubs to Graph

Dan McCreary
10 min readAug 25, 2019

How AI is evolving towards custom graph hardware

The evolution of large-scale big data technologies driving AI. We are building on knowledge gained from single-node graphs, Data Lakes, Data Hubs, and Enterprise Knowledge Graphs. In the future, we will see low-cost extensible custom VLSI graph hardware dedicated to solving large graph problems.

Last month I posted a short note on my LinkedIn account on the unfortunate decline of MapR. A company that had brilliant engineers trying to find a place in the crowded data products market. On the bright side, I was happy to see my post generated a lot of discussions. It made many people think more about where the “Big Data” industry is going and how this will impact large-scale enterprise analytics that are behind many of the innovations in AI. So here are my predictions on how the change in data-at-scale is driving the evolution of AI.

Before we talk about the future of integration data patterns, let’s recap three Big Data architectural patterns and how they are different. These are the Data Lake, the Data Hub, and the most recent edition, the Enterprise Knowledge Graph (EKG).

The Data Lake Pattern

A Data Lake is a system or repository of data stored in its natural/raw format usually object blobs or files. In large organizations, Data Lakes were used to store nightly raw table-by-table dumps of data from operational systems. The dumps were often simple flat CSV files with many numeric codes and low semantics. The theory was to get the raw data out of the fragile RDBMS systems so long analytical queries could be off-loaded. Expensive RDBMS licenses could be used for daytime transactions. The Data Lake was purported to be this magical place where Data Scientists and AI researchers could get their data to build high-quality predictive models. And it was indeed much more flexible than the rigid star-schema patterns that were required for analytical OLAP cubes. After all, anyone could add new data at any time to a Data Lake and you didn’t need to consult the data modeling team that had a six-month backlog of work and only believed in dimensional modeling as taught by Ralph Kimball.

Data Lakes had a huge spike of inflated expectations around 2014 when companies like Intel invested almost $700M in Cloudera (now merged with Hortonworks and still struggling to give their investors a good ROI). MapR represented one of the two companies still standing that promoted the use of Hadoop as a way to integrate all of a company’s data in a single large distributed file systems called HDFS (Hadoop Distributed File System). HDFS was a good fit for about 3% of the use cases in an organization since it was easy to scale and it was one of the first big breaks from legacy RDBMS systems. However, without support for features like ACID transactions, document search, caching, semantics and role-based-access control (RBAC) it really was not a good fit for the other 97% of company use cases. There were many NoSQL products that were superior to HDFS for scale-out enterprise-level database tasks.

The Data Hub Pattern

In response to the poor fit of HDFS to meet the needs of large organizations for a single view of the customer, another pattern evolved around 2015. This was the Data Hub pattern. A Data Hub is a collection of data from multiple sources organized for distribution, sharing, and often subsetting and sharing.

The Data Hub pattern was championed not by a simple key-value file system store, but by organizations promoting document stores. Companies like MongoDB, Couchbase and MarkLogic championed the use of document stores as a single place where all the information about a single customer could be stored in a single aggregated machine-readable document, usually in JSON or XML. Companies like MarkLogic really moved the ball forward at creating integrated views of customers. Not only did they have massive scale-out and high availability driven by a true peer-to-peer cluster, but they also provide excellent distributed ACID transactions, search, caching, semantics and document level role-based access control.

MarkLogic also provided features like built-in data models where you could optionally validate incoming data against a data model (like an XML Schema) and assign data quality metrics to each document. You could also query the data model and generate UI elements like pick lists from enumerated-field data elements. The ability to query your model promoted model-driven design and made it easier to build applications. These technologies were part of the XRX stack that I championed back in 2007. Every advanced AI architecture should be able to ask itself questions about its own data model and its own data quality. I should be able to ask the database, “what can you tell me about customer satisfaction surveys” and it should be able to show the types of records related to these surveys and counts.

In many ways, MarkLogic and their peers like eXist-db really were way ahead of their time when it came to model management. Validate was a keyword embedded directly in the XQuery language. Tools like oXygen XML could generate a precise validation model file from any crazy-messy folder of any XML data. A data quality ingest pipeline could be built in hours and searches and reports could be limited to documents with data quality scores above any given threshold. Many organizations that struggle today with simple model discovery and data quality in their “free” open source databases should consider making a built-in model discovery, model management and model query part of their future roadmaps so they can eventually get to where these firms were back in 2007. Assigning a single data quality score like a number from 1 to 100 to every document about your customer is critical for getting accurate data quality reports for every batch of data you add to your systems. We learned a great deal about the power of declarative languages like XML Schema and SHACL for managing quality and I hope we can transfer that knowledge in the future.

The Enterprise Knowledge Graph

Back in January of 2018, I noted that the demands of providing a state-of-the-art AI-driven system triggered a new integration pattern: The Enterprise Knowledge Graph (EKG). When I wrote that article I was sure graphs were going to dominate in the future, but I was not sure how quickly the scale-out graph technology would evolve. I always try to remember the famous Paul Saffo quote: “Never mistake a clear view with a short distance”. It is far easier to predict direction than duration in time.

Why are graph databases superior to document stores? In a word: algorithms. We now have a rich library of algorithms that quickly traverse large knowledge graphs to find insights in real-time. We are in the age of HTAP: Hybrid Transactional Analytical Processing — where what used to be considered an overnight analytics job can now be done in seconds. Rules engines, recommendation engines, explainable prediction engines are all feasible in real-time with these new algorithms. You can expect each core in your server today to be able to evaluate two million edge traversals per second. That means if you have 64 cores on each server theses algorithms can be doing 128 million rule evaluations per second. Graph algorithms also complement (but don’t replace) complex inference systems generated from deep learning models. These inference rules get easier to create each year, but they still have low explainability. Knowledge graph traversals can explain why specific paths have been traversed.

If you have not been on teams that are leveraging graph algorithms you are missing out. Doing things like clustering data, finding influencers, looking for anomalies and looking for similar customers means you just run a few functions on your graph. No need to hire a data scientist and export your data into python-friendly data frames. No need to buy racks of GPUs. The analysis can often be done in real-time within the graph and the results of this analysis are used to update graph properties for future queries.

We are also seeing a growth of the combination of algorithms from the deep learning community merge with this rich and evolving library of algorithms. For example, the image processing world has shown that convolutional neural network (CNN) algorithms work every well on Euclidian data like images where the distance between points on an image are uniform across the image. We are also seeing that CNNs also work on non-Euclidian data where the distance between customers in non-uniform when we look at multidimensional similarity scores. Graph Convolutional Neural Networks (GCNs) leverage the structure of the graph to find deep insights even with small training sets of a few hundred training examples. This means we don’t need expensive GPU clusters to train our predictive models. In data-mining structure is the new gold.

As an example of this powerful use of GCNs, a group of our summer interns led by University of Minnesota student Parker Erickson used GCNs to find patterns with higher precision than we have ever seen. This despite very small labeled training sets. These algorithms really work!

If you are following the graph community, it should be clear that vendors like TigerGraph and the new Cambridge Semantics OpenCypher systems are making scale-out, transaction-safe, distributed LPG graphs a true contender to replace both Data Lakes and Data Hubs. Neo4j may not be far behind. Like many NoSQL systems, these systems can start small and continuously grow and shrink just by adding or removing servers to your graph cluster. Because they are native implementations they have fantastic performance characteristics. Because they use declarative query languages like GSQL (for TigerGraph) and OpenCypher (for Cambridge Semantics) our developers can express what they want to query without having to think about how they want to query it. By using these languages we leverage a large and growing body of algorithms and we keep our algorithms portable. This means we avoid vendor lock-in and we lower risk.

For those of you that have used RDF/SPARQL in the past and have been hit with problems of constant reification issues causing you to have to rewrite your SPARQL code, we have good news for you! Because the newer LPG data models allow anyone to add new properties at any time to relationships we don’t have the constant burden of refactoring and the query collapse problems that came up with the older SQL and SPARQL systems. Our systems stay agile and existing queries continue to work even though new data is constantly being added to our graphs.

Towards the Hardware Graph

For those of you in the graph community doing performance tuning, you know that the ability of algorithms to work quickly over large amounts of data is dependent on how many cores can work in parallel and easily share their query results over a large cluster of servers. You also know that the more data you have the more cores you need to get consistent response times. So what is really going on in these cores? The answer is that this is 95% “pointer jumping”. They look like random memory lookups, small tests on properties and new lookups. Graph traversal has no need for elaborate floating-point instructions, no need for exotic hardware-level data encryption instructions and no need for complex branch prediction logic. We just need really fast pointer jumping CPUs, thank you! And we need lots of them all working together in parallel.

Unfortunately, when we look at today’s Intel x86 instruction sets the hardware needs to support 1,503 instructions. And as a former VLSI circuit designer, I can tell you that takes a lot of silicon real estate! What we need is a new generation of processors that only implements the instructions we need for graph traversals. We need Reduced Instruction Set Computers(RISC), not Complex Instruction Set Computers (CISC). ARM processors only need about 50 instructions (and various addressing modes). I estimate that for graph processing we need less than 100 instructions — which means that we should be able to put over 1,000 cores on each chip. Companies like GraphCore claim they can put up to 7,000 cores on their chips and they could all be working in parallel executing graph algorithms. Unfortunately, GraphCore is focused on the automotive market that uses a small scene graph with many updates per second and they are not currently targeting enterprise-level graph traversal benchmarks. I hope this changes in the future.

Graphics Processing Units (GPUs) have been known as also having a high level of parallelism. However, real-world benchmarks on GPUs are underwhelming for all but the smallest graph problems. For GPUs to work you need to convert graphs into matrix-like structures such as adjacency matrices. The bottom line is that doing matrix multiplication is really pretty far from our parallel pointer hopping problem we see in the big-graph space.

So who will solve these problems? Organizations like Cray Research and DataVortex have been using custom silicon and FPGAs to accelerate algorithms for years. Our 2014 book, Making Sense of NoSQL had detailed case studies of these solutions so they should not be considered novel by well-informed solution architects. But there are a few big problems with these systems today.

  1. They require either C-level library code or the antiquated SPARQL language which limits the portability of our algorithms.
  2. They are focused on analytics and don’t support ACID transactions which means that integrating new information may not be reliable where there are multiple concurrent writers.
  3. They are difficult to set up and benchmark for broad use cases when we have many different algorithms that need to scale. As a result, it is difficult to measure the ROI of this hardware.

What we need is a new generation of hardware vendors that know we want to store our algorithms in mainstream high-level graph query languages like Cypher or GSQL. Toward the end of 2020, we expect to start to see the new GQL syntax emerge which will hopefully merge the best from Cypher and GSQL.

I believe that the direction we need to move in is to build a community of AI researchers that are educated in what both graph algorithms can do at scale and how deep neural network learning algorithms can reinforce graph algorithms to build advanced HTAP solutions. We also need to make our hardware vendors understand the needs of the scaleable graph algorithm community. We need support for high-level declarative graph languages that perform queries over distributed native graph databases.

Disclaimer: Options are my own and may not reflect those of my employer.



Dan McCreary

Distinguished Engineer that loves knowledge graphs, AI, and Systems Thinking. Fan of STEM, microcontrollers, robotics, PKGs, and the AI Racing League.