This week eBay announced that they are making their internal distributed knowledge graph available through the an Apache 2.0 Open Source license. They call this product “Beam”.
Here are a few interesting items about Beam.
Just like DGraph, Beam is written in Go. Go is a compiled language that is very popular for compute-intensive server applications. Go is considerably faster than many Python or Java programs but has a much more modern syntax than C or C++. Go continues to gain in popularity for distributed database development.
Just like DGraph, Beam uses RocksDB, an open source key-value store as its underlying database. Even though RockDB is a pretty low-level system, this means we would not quality Beam as a native graph database. It is technically an overlay or layered graph. Beam probably stores both vertices and edges as keys with the values being a list of connecting items. To traverse a graph you would need to access each of the keys to get the connections in the values.
The authors acknowledge that the sweet spot for the current implementation of Beam is large knowledge graphs that need fast parallel reads but infrequent updates. They note that the architecture is intentionally clean and simple to support these use cases.
Here are the primary use-cases eBay sites on the github README file:
- Powers real-time interfaces
- Complement machine learning applications
- Makes sense of new, unstructured information in the context of the existing knowledge
From this description, my guess is that Beam is used internally at eBay to process new or updated product descriptions using NLP and store the classified tags in a graph that link to the product ontologies. These classifiers could then be used to accelerate product search in real-time. This is a good example of how both machine learning and knowledge graphs can work together. Since product descriptions are usually written once and then searched frequently, this fits their architecture well.
Here is what the README file says about its use within eBay:
Beam isn’t ready for production-critical deployments, but it’s useful today for some use cases. We’ve run a 20-server deployment of Beam for development purposes and off-line use cases for about a year, which we’ve most commonly loaded with a dataset of about 2.5 billion facts.
What is clear is that Beam has been tested at scale. It goes way beyond what can be done on a single node. That puts it is a small elite group of true distributed graph products.
The authors are also very frank about the systems limitations, something I find quite refreshing. They note that because they use a central log to serialize all changes there are some inherent scale-out limits for writes. So it will not replace Cassandra when you need writes at scale. Clearly, the system was designed to do parallel reads, but not parallel writes. Doing parallel writes is admittedly a much harder problem when we consider many systems need fast reads following writes. Something that Multi-version concurrency control (MVCC) can help with but with added complexity.
They also noted that unlike DGraph, they went beyond a simple GraphGQ query interface. They are working on an internal query language similar to SPARQL, but they don’t claim any adherence to a full SPARQL test suite.
The github site credits four primary authors: Simon Fell, Diego Ongaro, Raymond Kroeker, Sathish Kandasamy.
I want to thank both the authors and eBay for contributing their ideas and code to the open source community.
The source code is on eBay’s github site: https://github.com/eBay/beam