When selecting a database one of the questions a solution architect must answer is “Which database will best fit for my problem”? There has been much written on the process of matching requirements to a solution. Most of our book, Making Sense of NoSQL is focused on this problem. The answer is complex and involves people experienced with many database architectures helping guide this process. This article will focus one one specific question: What databases best support rapidly changing requirements?
The reference for this discussion comes from the theory of evolution. How do different animals pick different strategies for survival? Do they become specialized or do they remain generalists? Should they focus on traits that will allow them to quickly adapt to a changing environment or do this focus on one specific strategy — like eating one specific plant?
Two examples of these divergent strategies are the raccoon and the giant panda. Raccoons are generalists. They are omnivores that eat many things in their environment. Giant pandas on the other hand are highly specialized feeders. Their diet consists mostly of bamboo and have a digestive tract that has evolved to digest bamboo that other animals can’t. They survived because of their specialization. They found their niche.
So now we ask, “Which animal can adapt faster to a changing environment?” Since human populations have cut down the many bamboo forests that Pandas can inhabit we know that Giant Panda have unfortunately picked a strategy that has put them on the Endanger Species list.
Raccoon populations on the other hand are larger today than in any time in history. As human population has expanded we have changed the landscapes around out cities and farms. Raccoon populations have also exploded. Why? Because they are one of the few animals that can quickly adapt to changing environments.
So how does this apply to databases? Let’s take a look at the popularity trends on the site DB-engines.
We see that Graph databases have clearly topped the list of fastest growing databases. So why is this? I have thought about this long and hard. I have several theories, but the one theory that keeps coming up is the ability of graph databases to quickly adapt to changing requirements. With graph databases we can add new nodes and relationships to a graph at any time in the applications lifecycle. And we can usually do this without disrupting the existing data services. Graphs are the raccoons of the database world.
The technical term for this is “schema agnostic”. They don’t care about the structure of your data. Their schemas are always flexible. Even if only 0.1% of your data needs some new elements, they can adapt. They are not thrown off by exceptions. They grow gracefully over time.
This stands in sharp contrast to any databases that needs to work off of a pre-defined structure such as a relational or analytical database built around a star schemas. With these databases you need to know up-front all the future requirements of your system. Organizations typically spend six months to a year gathering detailed requirements before they create their data model.
Relational database also tend to slow down every time there are exceptions to one-to-one mapping rules. What was once a simple column to manage a one-to-one relationship (like a person to a passport number) becomes a join to another table when you find a rare many-to-many attribute. Yes, in the real world people do have multiple passports. Even though 0.1% of your data has multiple values, every query will need to add an extra join which slows down your queries.
Relational databases are the pandas of the database world. They are specialized to be used when data is simple and has a fixed format with low-variability. That does not make them bad: it just means they are inappropriate for some use cases.
I think many people from the graph community have an intuitive feeling their designs will be much more robust in the face of continually changing requirements. They also approach new projects with an agile process — lets get some sample data loaded and continually enrich the graph as we learn. They can also use the Load-as-is pattern and harmonize data as needed. They have a lot less fear of disruptive change.
Graph databases are not the only ones that are schema agnostic. The other NoSQL database types (key-value, column-family and document databases) also are schema agnostic. So the schema agnostic trait is not the only trait driving graph popularity. It also has to do with the emergence of graph query languages (SPARQL, Cypher and Gremlin) and the evolution of databases that have both scaleability and support ACID transactions.
So how does this fact impact your selection of a database? The bottom line is that if we are working with requirements that might change we should weigh schema agnostic solutions higher than other options.