For the last year I have been slowly moving to the realization that to be competitive, almost all organizations are going to need to build large internal Knowledge graphs (note the little “g”). If you are not familiar with the definition of a knowledge graph, you can check out Jo Stichbury’s excellent summary. Note that the term “Knowledge Graph” (big “G”) is associated with data behind Google’s info boxes but that “Knowledge graph” (the little “g”) redirects to Ontology on wikipedia.
Now let us extend the “Knowledge graph” definition to include a large integrated enterprise-wide graph created by an organization that will serve as its centralized source of integrated knowledge and inference. We will call this an “Enterprise Knowledge Graph” or its EKG. I like the abbreviation because it will also be a signal for the heartbeat of an organization’s knowledge framework.
I feel the reason that organizations need an EKG comes from many years of working with large organizations that have hundreds or thousands of isolated “data silos”. They often try to integrate data into a “data warehouse” for doing historical data analytics, but these systems also tend to create a huge cost burden on organizations — yet another copy of the data — yet another set of expensive ETL projects that need to be maintained — and yet another copy of data that needs to be secured and audited.
Many of my early years were spent first building relational systems and then adding additional OLAP systems that use a “Kimball” star schema for storing historical facts. These systems had some strong use cases, but they were brittle and were difficult to modify as requirements changed.
Around 2007 we also started to explore the new concept of “Data Lakes” and Hadoop. These were wonderful adventures away from the tyranny of single processors and helped us understand the benefits of distributed computing and bulk data transformations at scale. By “scale” we mean transforming terabytes of data using hundreds of processors working together.
For several years I also had exposure to the concept of a “Data Hub”, where data was stored in XML and JSON documents and linked together to provide an integrated or “360” view of the customer. However, although these systems were orders of magnitude more flexible than punchcard-inspired tables of data, I still saw a majority of budgets being spent on manually extracting data from legacy COBOL and relational systems. I kept asking myself — “How could we automate this process?”.
One of the executives at a company I was working for asked me if I could use ontologies to speed the integration of data. I told him we could, and I set off on a skunkworks project to demonstrate this capability. We found four distinct classes of “signals” in data that could help us speed the integration process. Each of these signals could be fed into a logistic regression or deep learning system to classify data. We showed that from this data we could dramatically speed the RDBMS to document mapping process. Unfortunately, due to transitions in my employment I could not continue this project. But it proved to me that many of the manual ETL processes could be automated if we had sufficient training data. I also realized that ontologies and inference were key to accelerating the mapping process.
When we started the NoSQL Now! conference back in 2011 I realized that there was a small but vocal group of architects that believe that for many business problems that graphs were the most flexible way to solve many problems. It is now clear that this these ideas not only had merit, but will become the dominant data structure that will differentiate organizations in the future. Just like document stores, graph stores have a flexibility that the other four architectures (relational, analytical, key-value and column-family) don’t have. What has also become clear is that you can implement graph stores on column-family stores and get some semblance of scalability beyond a single node. If we partition our graphs correctly, we can mitigate the problems of distributed graph traversal that plagued many early distributed graph systems.
The movement to EKGs has been accelerated in the last months by two new developments. One is the addition of the Neptune graph database to Amazon’s database portfolio. The second is the funding of both cloud and on-premise graph systems like TigerGraph and other Bay Area startups. Many of the graph-architects from Google, LinkedIn and Facebook (all using graph databases) are now venturing out on their own to develop solutions for the enterprise.
EKGs still present an interesting set of problems they have to solve to truly be considered Enterprise-class. Adding transactions, bitemporal versioning, role-based access control, audit, scalability and high-availability are all problems that traditional open-source proof-of-concept systems struggle with. Academics tend to not be interested in these real-word problems. Let’s hope that Neptune and the new venture funding pools address these challenges.
They key question is how do all the incredible developments in deep learning play into this? My answer is that deep learning will help us find algorithms that help us continually enrich our EKGs. We only need to find adequate training data sets to help them do this.
In my new role as a Distinguished Engineer my hope is we can start out 2018 with a clean slate and take a fresh look at use-cases that drive EKG. I have been inspired by large-scale healthcare graph projects like UMLS and GO. I hope to make it clear to everyone that deep learning and EKGs will become the core data structure we need to build our enterprise Skynets. My personal hope is that we can use these systems to lower healthcare costs and create better world for everyone.
Happy New Year Everyone!