Looking Forward to 2019 in Graph Technologies
I started out 2018 with a new focus in graph technologies. This year I will continue to look at graph technologies and how they are disrupting the status quo. This report is an analysis of some of the key graph database trends in 2018 and my predictions for 2019.
I should also put in a quick disclaimer. My employer (Optum Technologies) has a policy of not endorsing any vendors. This blog is only my personal observation of trends in the graph industry and is not an endorsement of any organization or product.
2018 — The year of the LPG knowledge graph
Last year we saw a dramatic uptake in the area of knowledge representation in the form of a labeled property graph (LPG). Organizations like Neo4j and TigerGraph grew their marketshare in this area and added a large number of new capabilities. New venture funding flowed into firms that are either on LPGs or quickly migrating to support LPGs. TigerGraph is noted for their native distributed graphs and Neo4j is noted for their innovation with products like Bloom.
TigerGraph has gained widespread attention for their combination of scalability and security. In September TigerGraph won the Most Disruptive Startup Award at the Strata conference. The key fact here is that distributed graph processing is hard. How a vendor partitions a large highly interconnected graph database and how they keep queries performant while maintaining ACID compliance over a distributed cluster is a wicked hard problem. TigerGraph has also introduced innovations such as accumulators in their GSQL language that have attracted the interest of developers.
Neo4j’s Bloom product, although expensive on a per-user basis, is of interest because it is starting to blur the lines between graph visualization and natural language query processing. Many Bloom queries minimize the need for knowing Cypher and focus on allowing non-technical people to do complex analysis of graph databases. Noe4j’s Bloom product and vendors like Cambridge Intelligence (KeyLines) and companies like Linkurious will continue to make graphs easier to query for the non-programmer.
Neo4j continues to break new ground with making it easier for new developers to pickup graph technologies. In 2018 they made significant enhancements to their Neo4j Desktop as well as providing extensive documentation on their graph algorithms. My congratulations to both both Mark Needham and Amy Hodler in writing their book A Comprehensive Guide to Graph Algorithms in Neo4j. We need more high-quality writing like this.
The Open Algorithms Movement
It is interesting to note that despite organizations patenting many graph algorithms, there seems a growing trend to make these algorithms public. Google and Facebook are leading this effort with their promise to their internal researchers that they can publish their work on various AI, ML, deep learning and graph algorithms. I think this is something they are forced to do to hire and retain top shelf AI and knowledge representation talent in their research and development divisions.
One prediction for graph product managers is that in the future, unless your graph database can run thousands of standard graph algorithms you will be at a disadvantage. This leads to the question of will there be a standard for LPGs so that innovative graph algorithms can have a wider impact. That question, along with the question of should paths be treated as first-class citizens, will be addressed at the W3C Workshop on Web Standardization for Graph Data on March 4th-6th in Berlin Germany. If you read between the lines, you see the W3C is clearly aware that LPGs have a large majority of the market and that SPARQL based standards may no-longer be relevant. Let’s all wish the W3C “good luck” so that we can continue to have shared standardized repositories of knowledge and standard algorithms to traverse these repositories. Note that I am not saying that RDF is not important here. On-the wire semantic standards and things like namespaces and URIs are going to increase in importance as we get more data. It is the query language that encodes the algorithms that has innovated faster than the standards bodies could keep up.
Overlay Graph Products Stagnate
I should also mention that we have not seen the corresponding growth in “Overlay” graph technologies. These are graph databases that run on top of other distributed column-family databases like Cassandra or key-value stores like Redis. My observation is that there is just too large of a performance gap between native property graphs and these layered solutions. These challenges are not evident when our graphs are small as we have seen in several pilot and proof-of-concept projects. It becomes a large financial burden on hardware and network traffic when we have had to scale overlay graphs from pilot to production with larger datasets. Several benchmarks published this year also made this clear.
Graphs and AI
One of the most significant developments of 2018 was the publishing of an influential position paper by Google DeepMind researchers and other from academia. This paper, titled Relational inductive biases, deep learning, and graph networks, has already been cited by hundreds of other papers and publications. At its core it says that when you flatten real-world data that has structure into tabular features for input to deep learning algorithms you lose a great deal of contextual information. By storing knowledge in a graph and allowing this graph to interact with your learned rules there are many advantages.
Although many people are enthusiastic about the ability of GPUs to quickly do matrix math, the real world does not do this. Our brains don’t have any circuits to do matrix multiplication and yet we reason far better than AI systems that use them. The key insight here is that both GPUs and our brains need to quickly do parallel non-linear transformations of data to seek insight into pattern recognition. GPUs just happen to be the most parallel devices around. I try to tell everyone around me that there is no clear binary division between graph-based rules engines and inference rules generated by deep-learning algorithms. Deep learning rules are just larger and harder to explain. In order to have explainable AI we need to bring both graph-rules engines together with machine-learning systems. Vendors that do this well with have a distinct advantage.
Graphs and Entity Resolution Rules
One use case of complex graph rules is determining if two entities (People, Organizations, Products, Providers, Customers etc.) are the same real-world thing or different things. These questions are all enabled because graphs are very good at doing very fast Similarity calculations. The rules to doing these calculations can also be tuned by using machine learning. Next, after we have calculated that two data sources about the same entity we need to intelligently merge this data. These are universal problems and they are also ideally suited to graph technologies. In 2019, look for graph vendors to either build these solutions into their products or have partners that can provide a solution. One vendor to watch in the Entity Resolution space is FactGem. They have a mature and robust framework for doing entity resolution on graph databases. Companies like Reltio are also using graph technologies in their cloud-based Entity Resolution services. Both in-graph and entity-resolution as a service will see growth in 2019.
Graphs and Corporate Social Networks
One of the most important papers of they year in graph technologies came not from technical journals but from the Harvard Business Review. In their Nov-Dec 2018 article Better People Analytics, Paul Leonardi and Noshir Contractor reviewed six key “signatures” (graph patterns) that organizations can use to look for things like innovation, influence, collaboration, efficiency and disfunction in individuals, teams and organizations. These algorithms can run along with the standard “skill/interest” searches in large enterprises. They answer questions like “who should be on this project” where experience, skills and interests drive the answers to staffing questions. Look for an increase use of graph technologies to make organizations run more efficiently in 2019.
Open Knowledge Graphs
I was also interested to see many organizations begin to see the need for open shared knowledge graphs. People that have been following the Semantic Web since 2001 know that this is not a new concept. What is interesting is how organizations are trying to promote AI technologies like speech recognition need shared knowledge graphs to compete with Google and Amazon. For example, the company Soundhoud has been promoting a platform approach to speech recognition where different organizations can publish both public and extensible knowledge graphs that are required for high-quality speech recognition. For example if you ask the question to your cell phone “How much does it cost to go from the nearest airport to the best Italian restaurant in San Francisco that has more than 4 stars, is good for kids, is not a chain and is open after 9pm on Wednesdays, and how long is the trip?” The key is that knowledge graphs for locations (airports, restaurants), facilities (restaurants), reviews and roads all need to be in a single graph for this query to run in real time. The SoundHound speech-to-text system really is just converting natural language speech to a graph query on (hopefully) public knowledge graph. But without this graph their better speech recognition has limited usefulness.
Custom Graph Hardware, FPGAs and the Future of Graphs
If you look at the instruction sets that most graph databases run today, you see lots of fast pointer-hopping over large address spaces. But you don’t see hardly any floating point or vector operations. I estimate that most graph queries use around 20% of the circuits on a typical CPU, and there are no graph vendors that can efficiently use GPU hardware. On the other hand there are innovative companies like Cray Research and DataVortex that can demonstrate fantastic graph performance by using Field Programmable Gate Arrays (FPGAs) and by tweaking the way that memory systems are accessed. These systems all give the programmer a feeling they are accessing 100TB of RAM, even though the memory is on different systems. Unfortunately, this incredible million-fold increase in graph processing power is currently only accessible to the older SPARQL queries (on the Cray) or C-level graph libraries (on DataVortex). My hope is that we will be able to run ALL of our LPG graph algorithms on these devices in the future. We know that graph queries can scale, we just need to make it easier to use these advanced hardware-driven solutions.
Graph Writers Wanted
I also want to encourage members of our graph community that continue to write on the graphs and the intersection of graphs and other technologies. We need more vendor-neutral LPG-centric articles and books on graph modeling, graph algorithms, graph visualization, knowledge graphs, machine learning, NLP, rules engines, recommendation systems and entity resolution. I am especially interested in following the new Manning book Graph Powered Machine Learning that Alessandro Negro is working on. As of today he has the first three chapters in draft on the Manning Early Access Program (MEAP) web site and I am sure he would appreciate more reviewers. Remember that our decision makers are usually making decision not on their deep understanding of how technology works. They make decisions based on the metaphors they have been taught. We need more metaphors that work for the right context. Please let me know if I can help encourage your writing and review your work.
In summary, I think that 2019 could be a key inflection point for graph technologies. AI can’t make progress without strong knowledge representation. Distributed native graph databases with strong access control hold many promises, but it is still difficult to predict when a mature ecosystem of graph database add-ons will provide complete solutions. I try to remember the Paul Saffo quote: Never confuse a clear view with a short distance. It is clear to me that distributed and secure native LPGs are going to dominate the database market and replace not just SPARQL but many relational systems. I just can’t tell you how quickly this will happen.
Happy New Year everyone!