Data Lakes: sometimes we love them, sometimes we hate them. So how do we love them more and hate them less? One way is to use a graph of concepts and rules to make sense out of the rows of “1”s and “0”s in our Data Lakes. Let me explain.
This all began with a simple attempt to save some money. Relational database chargeback rates (what the IT department changes a business unit) for relational database access was running over $100K/TB/year. So we got the bright idea to store our data in a distributed file system. Something like the Hadoop file system (HDFS) or Amazon’s simple storage solution (S3) which currently runs about $276/TB/year (2.3 cents/GB/month) . And initially this was a good thing. We finally got away from the dogma of a single node and we moved into the wonderful world of distributed data. We got the benefits of high-availability and scalability but without the high storage costs.
But we had a problem. Our relational databases like to store data in the form of numeric codes in these data lakes. Sometimes only the application that created the data could find the true meaning of these codes. It sometimes depended on the “code table” that converted the selection list on our form into a numeric code. And these forms might change every month. As a result our Data Lake became a Data Swamp. We had the data but no one could decode it.
In the last few years we have been seeing an entire new generation of tools to help us decode these values and turn them into useful information — we could now see the data not just as Binary data but as customers, products, purchases, calls, chats and e-mail letters. These tools have come by many names:
- Semantic Data Lakes
- Data Hubs
- Semantic Data Marts
- Enterprise Knowledge Graphs (my personal favorite)
What many of these systems have in common is that they are putting two key technologies to work to decode this binary data: semantics and graph databases. The result is what some people are now calling the Ontology-Based Data Access (OBDA) pattern[*]. It is more than a simple pattern. It is a knowledge architecture pattern. And more important than what you call it is what these access patterns enable. They make a large amount of data easier to access for our developers. It is all about convenience for the developer.
However, there is one caveat. Our developers need to stop thinking about data in terms of comma-separated value (CSV) files full of obscure numeric codes that they spend days or weeks decoding into useful knowledge. They need to begin to think of their data in terms of graphs (and documents) of connected entities and let the ingestion ontologies do the heavy lifting. Getting developers to re-think their world is not an easy task and many organizations have a lengthy re-training process in front of them.
The ODBA data access pattern is non-trivial and difficult to explain to those not-familiar with AI concepts. We could spend hours explaining how terminology graphs (aka Tbox) and assertion graphs (aka Abox) graphs were different and how they supported the process. Note that when I say AI concepts, I am not referring only to the narrow field of deep learning. I am referring to a broader field that includes what many now call Good Ol’ Fashioned AI (GOFAI).
GOFAI includes many areas, but knowledge representation is something that has been underappreciated by those enamored by their TensorFlow programs and their GPUs. Cool stuff, I agree. And we are making great progress doing things that we have not done in the past. Yet without strong knowledge representation where do we store these insights? How can we possibly transfer insights between domains without a robust knowledge representation?
To be honest, although the OBDA data access pattern is a good summary of many of the things I have done in the past, it has several sub-components that are still new to me and I am struggling to understand in a way to explain it to my peers. I look forward to any comments readers have at helping us all understand these ideas.
[*] — For an detailed discussion of ODBA see the book Exploiting Linked Data and Knowledge Graphs in Large Organizations (2017 Springer). The book has many problems but it still has a good summary of ODBA and how it is being used by organizations.