Enterprise Knowledge Graph 2020

10 trends that will help you develop your enterprise graph strategy

Image for post
Image for post
2020 brings an exciting year in graph technologies! Here are ten trends to watch.

Background on the EKG problem space

What I hope to do is a persistent year-by-year analysis of hundreds of alternative ways to represent data. Not just with graph but many alternative ways to represent data. This analysis started back in 2006 when Kurt Cagle badgered me into looking into eXist-db. I have never assumed all data must fit a single architecture since then. To Kurt, I am eternally grateful. Everyone needs a mentor like Kurt to encourage us to look beyond our current worldview.

I also think more technical staff should be familiar with a more holistic analysis process that goes beyond simple technical analysis of the pros and cons of a single graph product. We need a more in-depth analysis that starts with our enterprise integration strategy and goes down to the chip level.

There are still many fun smaller graph use-cases that I will be touching on in the coming year. Graph rules engines, workflows, goal graphs, and cognitive tech are good examples of these topics. However, my main focus for the next year is the incredible challenges and opportunities of enterprise-scale knowledge graphs will have on organizations. These graphs can be large and usually must be distributed over dozens of servers that each cost over $50K. I believe that in the next few years that organizations that embark on these EKG projects may see huge returns on investment.

Graph Trend #1: Native Labeled Property Graph (LPG) dominates graph technologies

Image for post
Image for post
A Google Trends report comparing interest in Cypher and SPARQL for the prior year. See https://trends.google.com/trends/explore?q=cypher,SPARQL

All the fastest growing graph companies support the LPG data models, but they are using native LPG models. The announcement this year by Cambridge Semantics that they will support LPG and Cypher was the final nail in the SPARQL coffin for the Enterprise Knowledge Graph market. There will still be a few niche areas that will continue to use SPARQL. These products will continue to exist because they still have good inference engines and data quality standards like SHACL. As LPG graph products and standards mature, SPARQL and RDF in the database will continue their market share decline. This does not imply that RDF is not an excellent on-the-wire standard for data exchange. The reification patches for RDF* just have proved “too little, too late” for a CIO to take it seriously.

There have also been attempts to build layered graph products on top of other databases such as key-value stores or column-family stores. None of these technologies has made an impact in the EKG market, and most benchmarks still show there is an order-of-magnitude loss in performance when you attempt this. Of course, for small graphs that easily run on a single node and don’t require complex tuning, the ease-of-use of a layered graph-on-other architecture might be okay.

Graph Trend #2: The GQL standard picks up momentum

I also had the chance to hear one of the GQL leaders, Keith Hare, speak at the Graphorum conference. Keith has become one of my heroes. He brings decades of experience from the SQL standards areas but combines it with a deep understanding of standards that the LPG community needs. His leadership has kept the standards organization moving in the right direction. Keith and others have spent hundreds of hours a year on standards without any compensation. Thank you for your work on this important project!

Graph Trend #3: Queen similarity ascends the throne

Of all these diverse algorithms, in the context of the EKG, the similarity algorithm kept coming up over and over. I now consider similarity algorithms as the “queen algorithm” of the EKG ecosystem. If you want to recommend something to a customer, finding what similar customers also purchased is essential. If you have an error in a log file, you want to compare a graph version of this error to 10M other log file errors. If you have an unusual problem with your desktop, finding similar desktops with similar errors helps the debugging process. No other algorithm comes up as frequently in the EKG space.

Last year also saw the publication of one of the best books on graph algorithms. Graph Algorithms: Practical Examples in Apache Spark and Neo4j By Mark Needham & Amy E. Hodler. My compliments to both Mark and Amy on this excellent book and for making these algorithms more accessible to the graph community. We need more great writers like Mark and Amy to make graph technologies accessible to high-school students.

Graph Trend #4: Hardware similarity with FPGA becomes a commodity

One vendor, Xilinx, delivered a stunningly low-cost solution. For about $2,000, anyone can implement the cosine similarity algorithm in one of Xilinx U50 Field Programmable Gate Array (FPGA) boards. It returns a list of the 100 most similar customers in around 25 milliseconds! My congratulations to Dan Eaton, Xilinx’s Senior Manager of Market Development of Accelerated Computing, for recognizing the relevance of the graph market. The brilliant Kumar Deepak also should be recognized for implementing the cosine similarity algorithm using only about 35% of the circuits on the U50 FPGA.

The reason that this is super fast on an FPGA is that at its core, similarity calculations are classified as “embarrassingly parallel” computations. To run efficiently in any real-time setting, we need parallel computer hardware. Granted, we do need to convert your customers to 200-element vectors of 32-bit integers. You then use an FPGA to compare and rank the most similar vectors using dot-product calculations in hardware. The vectors of the 10 million customers can be cached on onboard RAM on the FPGA.

Fast similarity on FPGA does not mean that all graph algorithms are this easy to parallelize. What it does convince me is that real-time product recommendation systems around the world, which all begin with similarity calculations, can now be converted from using large numbers of expensive CPUs to a single server. Amazon F1 instances with FPGAs are available for about a dollar an hour extra. You heard it hear fist!

Graph Trend #5: Data Scientists and the deep learning community embrace graph

The trend is that people that have applied machine learning and specifically deep learning algorithms in areas like image processing and NLP now realize that these algorithms can also be used on native LPG graph data.

Two specific algorithms have started to get high visibility: Graph Convolutional Networks (GCN) and graph2vec. GCN algorithms are variations of image recognition algorithms that perform convolutions on images. Here you can think of the word “convolution” as a way of looking at the pixels around a given pixel to do a transformation of the image. Graph2vec algorithms are similar to the word2vec algorithms that we find in NLP. They automatically take very high dimensionality problems and reduce them down to a smaller set of dimensions — for example, from 100,000 dimensions down to 200 dimensions. This is something that is critical for us to make a fast comparison of complex customer data.

Graph Trend #6: Graph communities mature

Graph Trend #7: Graph-at-scale brings new challenges

Our second challenge is partitioning these massive graphs into manageable subgraphs for some tasks while still maintaining high queryability across the entire graph.

Many solution architects are reluctant to merge things like rules, workflows, goals, and metadata directly it the EKG. But not doing putting everything in one graph violates the prime directive of EKG — connected data is more valuable than non-connected data. Connecting this may seem obvious to some people, but you would be surprised at how strong the “isolate data to protect it” instinct is in many solution architects.

Graph Trend #8: Graph hardware still in its infancy

Graph Trend #9: Metadata and data quality standards for LPGs will emerge

I am 100% sure that the native LPG community will have access to all these wonderful tools in time. If you are looking to create a startup, you should think about the opportunities that will be available when there are ten vendors that all support GQL. Please contact me if you need inspiration!

Graph Trend #10: Complex Adaptive System (CAS) theory predicts Emergence

Scale laws are interesting when we look at the patterns that can be found as your graphs grow from millions of vertices to billions of vertices and eventually to hundreds of billions of vertices. With each order-of-magnitude increase in data connectivity, there are new ways to see patterns in this data. One of the best ways to predict a new link is to have lots of other links around.

I don’t know exactly when CAS theory will begin to be an important part of EKG theory, but I am eager to continue to read more about it, and I look forward to your thoughts.

Conclusion

AI and graph technologies will be the two big drivers that will impact real-time decision support, search, chatbots, NLP, automated customer services, recommendation systems, product management, digital marketing, rules engines, goal setting, workflows, integration, automated schema mapping, data discovery, anomaly detection, security threat analytics, master data management, and next-best-action prediction.

Distinguished Engineer with an interest in knowledge graphs, AI and complex systems. Big fan of STEM, Arduino, robotics, DonkeyCars and the AI Racing League.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store