10 trends that will help you develop your enterprise graph strategy
Background on the EKG problem space
When I started writing blogs around enterprise knowledge graph (EKG) technologies two years ago, I had no idea how quickly graph technology was maturing to handle the incredible scale that EKGs demand. When I started doing my investigation, I knew that graph technology was the fastest-growing segment of the database market. But I didn’t realize how graph technologies were beginning to take center stage in our efforts to created integrated views of our enterprise data. I also didn’t know there was such a strong need in the community of solution architects to understand why these trends are accelerating. I am also grateful to my readers for their feedback encouragement to continue.
What I hope to do is a persistent year-by-year analysis of hundreds of alternative ways to represent data. Not just with graph but many alternative ways to represent data. This analysis started back in 2006 when Kurt Cagle badgered me into looking into eXist-db. I have never assumed all data must fit a single architecture since then. To Kurt, I am eternally grateful. Everyone needs a mentor like Kurt to encourage us to look beyond our current worldview.
I also think more technical staff should be familiar with a more holistic analysis process that goes beyond simple technical analysis of the pros and cons of a single graph product. We need a more in-depth analysis that starts with our enterprise integration strategy and goes down to the chip level.
There are still many fun smaller graph use-cases that I will be touching on in the coming year. Graph rules engines, workflows, goal graphs, and cognitive tech are good examples of these topics. However, my main focus for the next year is the incredible challenges and opportunities of enterprise-scale knowledge graphs will have on organizations. These graphs can be large and usually must be distributed over dozens of servers that each cost over $50K. I believe that in the next few years that organizations that embark on these EKG projects may see huge returns on investment.
Graph Trend #1: Native Labeled Property Graph (LPG) dominates graph technologies
One quick look at Google trends should convince you that native LPG graphs have come to dominate the graph technologies landscape.
All the fastest growing graph companies support the LPG data models, but they are using native LPG models. The announcement this year by Cambridge Semantics that they will support LPG and Cypher was the final nail in the SPARQL coffin for the Enterprise Knowledge Graph market. There will still be a few niche areas that will continue to use SPARQL. These products will continue to exist because they still have good inference engines and data quality standards like SHACL. As LPG graph products and standards mature, SPARQL and RDF in the database will continue their market share decline. This does not imply that RDF is not an excellent on-the-wire standard for data exchange. The reification patches for RDF* just have proved “too little, too late” for a CIO to take it seriously.
There have also been attempts to build layered graph products on top of other databases such as key-value stores or column-family stores. None of these technologies has made an impact in the EKG market, and most benchmarks still show there is an order-of-magnitude loss in performance when you attempt this. Of course, for small graphs that easily run on a single node and don’t require complex tuning, the ease-of-use of a layered graph-on-other architecture might be okay.
Graph Trend #2: The GQL standard picks up momentum
This is the first year I have been back in the standards area in many years. I was pleasantly surprised to join the GQL conference call and hear that there were almost 30 organizations and individuals that either are participating or plan to join in the GQL standards committee process. Database companies Neo4j, TigerGraph, Oracle, Faircom, Amazon, and Google, there are also graph user communities from healthcare (Optum) and federal agencies participating in the standards process.
I also had the chance to hear one of the GQL leaders, Keith Hare, speak at the Graphorum conference. Keith has become one of my heroes. He brings decades of experience from the SQL standards areas but combines it with a deep understanding of standards that the LPG community needs. His leadership has kept the standards organization moving in the right direction. Keith and others have spent hundreds of hours a year on standards without any compensation. Thank you for your work on this important project!
Graph Trend #3: Queen similarity ascends the throne
When I began analyzing the graph industry, I became fascinated with the focus on graph algorithms within the graph community. No other NoSQL areas have a focus on the standardization of algorithms for things like search, pathfinding, clustering, centrality, dependency analysis, and similarity. The reason, it turns out, is that implementing a typical algorithm in an RDBMS would highlight the massive performance problems in traversing relationships.
Of all these diverse algorithms, in the context of the EKG, the similarity algorithm kept coming up over and over. I now consider similarity algorithms as the “queen algorithm” of the EKG ecosystem. If you want to recommend something to a customer, finding what similar customers also purchased is essential. If you have an error in a log file, you want to compare a graph version of this error to 10M other log file errors. If you have an unusual problem with your desktop, finding similar desktops with similar errors helps the debugging process. No other algorithm comes up as frequently in the EKG space.
Last year also saw the publication of one of the best books on graph algorithms. Graph Algorithms: Practical Examples in Apache Spark and Neo4j By Mark Needham & Amy E. Hodler. My compliments to both Mark and Amy on this excellent book and for making these algorithms more accessible to the graph community. We need more great writers like Mark and Amy to make graph technologies accessible to high-school students.
Graph Trend #4: Hardware similarity with FPGA becomes a commodity
Back in August, I began a discussion about the need for hardware that is better tuned to the needs of large distributed graph technologies. I created a challenge called the “Customers Like Me” challenge that asked vendors to help us compare any given customer to a database of 10 million customers in real-time. The challenge is how quickly your hardware could return a list of the one hundred most similar customers from a pool of 10 million customers. We had meetings with companies like Intel, Dell-EMC, Hitachi, Nvidia, DataVortex, Cray, and several others. We talked to some of the world’s best hardware architects. Several of these firms proposed hardware costing millions of dollars.
One vendor, Xilinx, delivered a stunningly low-cost solution. For about $2,000, anyone can implement the cosine similarity algorithm in one of Xilinx U50 Field Programmable Gate Array (FPGA) boards. It returns a list of the 100 most similar customers in around 25 milliseconds! My congratulations to Dan Eaton, Xilinx’s Senior Manager of Market Development of Accelerated Computing, for recognizing the relevance of the graph market. The brilliant Kumar Deepak also should be recognized for implementing the cosine similarity algorithm using only about 35% of the circuits on the U50 FPGA.
The reason that this is super fast on an FPGA is that at its core, similarity calculations are classified as “embarrassingly parallel” computations. To run efficiently in any real-time setting, we need parallel computer hardware. Granted, we do need to convert your customers to 200-element vectors of 32-bit integers. You then use an FPGA to compare and rank the most similar vectors using dot-product calculations in hardware. The vectors of the 10 million customers can be cached on onboard RAM on the FPGA.
Fast similarity on FPGA does not mean that all graph algorithms are this easy to parallelize. What it does convince me is that real-time product recommendation systems around the world, which all begin with similarity calculations, can now be converted from using large numbers of expensive CPUs to a single server. Amazon F1 instances with FPGAs are available for about a dollar an hour extra. You heard it hear fist!
Graph Trend #5: Data Scientists and the deep learning community embrace graph
This year we have started to see a larger number of papers that are bridging the gap between two of the great tribes in artificial intelligence: the machine learning network tribe and the symbolic reasoning tribe. Single-node knowledge graph people often tend to come from the symbolic reasoning tribe. Their work on RDF, OWL, and semantic web reasoning back in the early 2000s separated their knowledge from the knowledge of machine learning people that evolved from the statistics (SAS and R) communities.
The trend is that people that have applied machine learning and specifically deep learning algorithms in areas like image processing and NLP now realize that these algorithms can also be used on native LPG graph data.
Two specific algorithms have started to get high visibility: Graph Convolutional Networks (GCN) and graph2vec. GCN algorithms are variations of image recognition algorithms that perform convolutions on images. Here you can think of the word “convolution” as a way of looking at the pixels around a given pixel to do a transformation of the image. Graph2vec algorithms are similar to the word2vec algorithms that we find in NLP. They automatically take very high dimensionality problems and reduce them down to a smaller set of dimensions — for example, from 100,000 dimensions down to 200 dimensions. This is something that is critical for us to make a fast comparison of complex customer data.
Graph Trend #6: Graph communities mature
In the past, we have seen inconsistent conferences on graph technologies. Except for Neo4j’s GraphConnect conference, few graph conferences drew large crowds. This all changed with the Graphorum conference in Chicago in October. Graphorum was the first graph conference I attended where there was a wide variety of vendors and customers using different technologies that were exchanging great ideas and getting ready for the emergence of the GQL standards. My thanks to Tony Shaw and his team at Dataversity for making this conference a success. We need more vendor-neutral conferences like this!
Graph Trend #7: Graph-at-scale brings new challenges
One of my key observations of the last two years is that there are many hidden benefits to EKGs, but also many challenges. Distributed graphs have a unique set of concerns, such as how you distribute queries out to a cluster, store local results within each node, and then collate the results. These are the same problems we had with Map-Reduce queries in Hadoop (RIP). TigerGraph has introduced the concept of Accumulators into the GSQL language. They intend to allow the GQL standards to leverage these concepts. We hope that some future version of GQL does benefit from the work TigerGraph has done. Our developers love Accumulators. They make queries much smaller, which makes them both easier to read and easier to debug.
Our second challenge is partitioning these massive graphs into manageable subgraphs for some tasks while still maintaining high queryability across the entire graph.
Many solution architects are reluctant to merge things like rules, workflows, goals, and metadata directly it the EKG. But not doing putting everything in one graph violates the prime directive of EKG — connected data is more valuable than non-connected data. Connecting this may seem obvious to some people, but you would be surprised at how strong the “isolate data to protect it” instinct is in many solution architects.
Graph Trend #8: Graph hardware still in its infancy
For the past year, I had been hopeful that companies like Graphcore would provide general-purpose but extensible hardware for the efficient implementation of graph algorithms. Other hardware manufacturers tried to suggest that a future-standard call Gen-z would make it easier to extend a single node or cluster to include more CPU and memory hardware without disrupting services. And I am confident that Gen-z will work in the future. For the next year, look to FPGAs to solve many of the graph algorithms that need to execute in parallel. The next generation of FPGA will contain over 30 billion transistors that can be rewired for specific CPU intensive algorithms in under two seconds. There are a LOT of time-intensive graph algorithms that can be attacked using the FPGA approach.
Graph Trend #9: Metadata and data quality standards for LPGs will emerge
One of the great features of mature RDBMS systems is they have a mature market for tools that manage metadata. Many data stewardship tasks, data provenance analysis, and data quality issues use sophisticated software packages. You can do things like understanding how many reports are impacted if you change a column name. I always look with envy that StarDog has native support for data quality standards like SHACL.
I am 100% sure that the native LPG community will have access to all these wonderful tools in time. If you are looking to create a startup, you should think about the opportunities that will be available when there are ten vendors that all support GQL. Please contact me if you need inspiration!
Graph Trend #10: Complex Adaptive System (CAS) theory predicts Emergence
My friend Arun Batchu and I co-presented at the Medfuse conference this year on the fascinating topic of Complex Adaptive Systems (CAS). Before Arun pushed me to read about CAS theory, I had no formal education in predicting the benefit of connected systems. CAS theory does not outright make these predictions, but it creates a formal methodology to help you do some of this analysis. CAS is a hybrid of several technologies, including graph and network theory, chaos theory, evolution, genetic programming, and scale laws.
Scale laws are interesting when we look at the patterns that can be found as your graphs grow from millions of vertices to billions of vertices and eventually to hundreds of billions of vertices. With each order-of-magnitude increase in data connectivity, there are new ways to see patterns in this data. One of the best ways to predict a new link is to have lots of other links around.
I don’t know exactly when CAS theory will begin to be an important part of EKG theory, but I am eager to continue to read more about it, and I look forward to your thoughts.
You can see that although I am very optimistic about the future of graph technologies, I am somewhat cautious about picking specific dates when future events might occur. I always try to keep the Paul Saffo quote in mind: Never mistake a clear view with a short distance. I would love to predict that the grammar for GQL will be done by this time next year and that two years from now, we have ten vendors that all implement GQL standards. But there are just too many unknowns. Instead, what I will say if that you are doing strategic planning for your company, make sure you have someone familiar with both AI and graph technology at the table.
AI and graph technologies will be the two big drivers that will impact real-time decision support, search, chatbots, NLP, automated customer services, recommendation systems, product management, digital marketing, rules engines, goal setting, workflows, integration, automated schema mapping, data discovery, anomaly detection, security threat analytics, master data management, and next-best-action prediction.