Graph Databases, Data Modeling and the Jenga Tower Metaphor
One of the challenges of Big Data is dealing with highly variable data. There are other challenges which we can recall by the other “V”s (volume, velocity and veracity), but here we are going to talk about data variability and its impact on business agility. We are going to see that when we use a graph database we can often avoid rewriting reports when we modify our data model. This gives graph databases a significant agility advantage over traditional relational databases.
Many people think of variability as a characteristic of a data domain where only a small percentage of records require specific data elements. For example you have millions of records that describe a patient's weight, but only 0.1% of the records are from the newborn area where you need the weight to be very precise and you need to know who took the weight measurement, the date and time, on what scale and if the scale was recently calibrated. In a relational database this can be difficult to model since all this additional information might need to be in additional columns in the patent’s table.
However, variability also occurs over time. In some business problems you can be confronted with new data that needs to be added to our database every week. In some cases these changes are just new columns that need to be added to a table and they have a low impact on your library of reports that run against your database. Often however, these changes require your database to be redesigned. This is where all hell breaks loose because it also requires that a large percent of reports also need to be updated. This is the Jenga Tower collapsing. This means to add a single feature your teams need to make hundreds of hours of changes to the database and to the reports and they all need to be updated in sync without causing downtime for your users. This is a very tricky problem in relational database systems! Constantly rebuilding the Jenga Tower takes time and money.
This is one place where graph databases tend to shine. Just like other NoSQL databases, graph databases are considered “Schema-less”, “Schema-free” or “Schema-agnostic”. This means that you don’t need to have a formal data model designed before you add some new data. This is also sometimes called the “Open World Assumption” (OWA) where anyone can add any node to a graph without impacting a centralized data model.
Now the question is: “What about all my graph reports? Will they still work?”. The answer is often “Yes” in graph databases. Let me explain why.
Relational databases are ironic because relationships are often not precisely defined when you create the model. Relationships between tables are created when a query is executed. Relationships are part of the report logic, not just the model. The programmer needs to also know that two tables are designed to be joined together and what columns are used in these joins. As a consequence, the relationship logic needs to be embedded in every report and changes to the model often break these reports.
Graph databases on the other hand take an entirely different approach. Queries can often be written in a way that does not depend on any specific relationships. One example of this is a graph similarity calculation. Similarity calculations are frequently used in recommendation systems. We ask the order-entry system what “similar” purchased did other customers make.
A more complex version of this query also existing in healthcare when searching for the best treatment options. For example, let’s assume we have a local hospital chain with a million patients, and each patient has 10,000 data elements in their electronic medical record. Now for a given patient we want a query that will generate a list of the most similar patients and their care plans. This is difficult to do in relational databases but it can be a simple query in a graph database and it often runs within seconds. Adding new clinical data to the EMR does not require us to rewrite this query.
To be fair, there are still some types of data model changes in graph databases that are very disruptive to your queries. These are when there are major refactorings between key entities in your graph. My experience is that these highly disruptive queries get fewer and farther between as your graph model reaches maturity.
In summary, the Jenga tower metaphor is used when we trying to understand when a graph database might be a better fit than a relational database. The Jenga tower story is useful when we have high-varaibility data (variable in content and time) and hundreds of queries that need to be maintained.
The Jenga tower metaphor is one of the top ten metaphors we use to help non-technical decision makers understand when to use a graph database. Where we have highly variable data we don’t want to be constantly rewriting our queries. We want to keep adding data and the more data we have the better our analysis. We need to support highly variable datasets and we want to maintain our enterprise reporting agility.