Integrating PKGs into the Enterprise
In my past blogs, I have written about the fast-growing topic of Personal Knowledge Graphs (PKGs) and knowledge management. This blog will focus on the challenges of integrating PKGs into larger knowledge ecosystems such as your company’s enterprise knowledge graph (EKG) or your college and universities knowledge management system.
The intended audience for this post is a solution architect that has been asked to understand the trade-offs between many isolated PKG silos and more integrated company knowledge graphs that share knowledge and avoids knowledge duplication and inconsistency. We look at the limited integration options available today and forecast large return-on-investments as PKG products integrate the latest recommendations of large machine learning models.
If you are new to the area of PKGs, we strongly recommend reading the prior articles on PKGs, EKGs, and knowledge management to fully understand these concepts.
The Personal vs. Organizational Reuse Tension
Before we dive into terminology, let’s remember that there are always two opposing forces driving PKGs in an organization. The first force is the need for each person to quickly capture notes that are consistent with their personal knowledge base. As you type, autosuggest lists allow you to quickly connect to your own personal concepts in your own PKG. When you type “[[“ in the text editor, a list of existing concepts is presented. As you type, the list is automatically narrowed to match the prefix you have already started.
The second force is the desire of many companies to capture this knowledge in a way that can be used by others in the company. This means that big organizations will often put in additional rules that prevent quick knowledge capture. Do you want to issue a ticket in the helpdesk system? You must tell us about your computer, what you are trying to do, what application you are using, and some information about how to reproduce the error. Without these required fields filled in, the Save button is disabled.
My experience is that we will always see trade-offs. Who pays for the software, how the knowledge is captured, and how knowledge-sharing MBOs are written is often driving concern. There is no one-size-fits-all solution here, and the use of large machine-learning Natural Langauge Processing (NLP) tools is only making the decisions more complex. We will discuss this topic more later in the blog.
Classification of Current Tools
To begin our discussion, consider two desktop tools. The first is a standalone editing tool such as Microsoft Notepad(TM), which comes with your Windows(TM) desktop applications. Think of this as adding a new leaf of knowledge to a small new sapling of a tree. This tool allows you to type in any text from your keyboard and save it to your local file system. A Notepad is ideal for capturing new information that is disconnected from the rest of the world. You don’t even need to be connected to your corporate network to use it.
Wikis also allow you to enter freeform text, but always within a page that has a unique name. Your text may contain references to other named pages. Some Wiki pages can contain structured key-value pairs, often called Infoboxes. The key difference between Notepad and Wiki is that Wiki pages are stored on a server and converted to web pages with links.
Now let’s consider a PKG editor such as Roam or Obsidian. They are always connected to an existing knowledge base, and the focus is creating quick links to your internal personal graph. Here is a metaphor we want to understand clearly from Figure 1. PKGs are useful when we also want to capture new information that extends the existing knowledge structure. For example, when you enter a simple concept name, the system checks to verify that it is a new concept and not a duplicate of an existing concept or an alias for an existing concept.
On the lower right corner of the chart are formal ontology editing tools such as Protege and graph databases such as Neo4j or TigerGraph. These graph databases have many constraints, such as formal types for every vertex and edge and strict rules about what types of vertices can connect to each other. Edges also must have a type.
The last box in Figure 3 is the Machine-Learning assisted PKG (ML-PKG), where NLP tools are used to automatically suggest text within your PKG editor. These tools don’t exist yet, but with the fast evolution of large-language models such as BERT and GPT-3, we expect them to be available soon.
In summary, there will be many rules in an integrated Enterprise PKG. Many of these rules can be helpful. They can accelerate your work by adding knowledge using intelligent contextual autocomplete. Some of these rules will get in the way and slow down your knowledge capture rate.
Let’s start our discussion with some basic terms and concepts. In a PKG, the goal is to efficiently allow a typed stream of characters to efficiently capture knowledge concepts and make connections to prior concepts. In a PKG, we use a data structure called a knowledge graph. A graph stores not just a flat list of concepts but also the relationship between these concepts. The technical term we use for concept storage is a vertex, and the term for relationships is called an edge.
Untyped vs. Typed Systems
Many of us are familiar with the concept of a wiki. The word “wiki” is a Hawaiian term for “quick .” Wiki page link syntax allows users to minimize the number of keystrokes that connect wiki “pages” to each other. In the wiki data model, all concepts are stored on a wiki page, and the design goal is to follow the one-page-per-concept rule so that the concepts can be easily linked together. If you put two or more concepts on a page, a link to that page would not clearly show a single-concept-to-single-concept pattern.
Wiki’s were a wonderful step forward in notetaking because the concept pages or concept cards were shared in a web server. As your peers added new concepts, you could create links to these concepts. As you added new concepts, your peers could also link to them. It was a huge breakthrough in shared knowledge management, and it is the basis for systems such as Wikipedia.
However, most people don’t think of a wiki as true “knowledge graphs” because they really are just collections of linked untyped documents, just like the world-wide-web. The wiki designers wanted to extend the concept of “hyperlinks” and make the links easier to type than remembering the complex syntax of HTML anchor references. The fundamental difference between a wiki and a true knowledge graph is the fact that wiki pages are all of the same “type” (a document/card/concept type), and the relationships also all have a single type (is-related-to).
True knowledge graphs allow each vertex and edge have a specific type. The list of possible types is often configured by a centralized department. For example, some vertices might represent an employee, some might represent products you sell, and some might describe business events such as a call made to your company helpdesk. The key thing about types is that we can associate rules with types, and these rules force consistency that makes knowledge graphs easy to query, just like we query a traditional database using SQL.
If you are adding a new city to a PKG, you can tell the system the vertex is of type “city” and also add a “located-in” relationship to a “state” vertex. Both vertices and relationships have types.
In the field of enterprise knowledge graph strategy, we use the Jim Hendler hypothesis “A little semantics goes a long way.” Adding a few simple types to both vertices and edges is the principal way to add meaning (semantics) to PKGs to supercharge their quearability and reuse.
Many organizations have found that shared wikis are an ideal system for storing and linking knowledge. However, most organizations don’t query the wikis because they put a priority on free-form editing over strict rules. You can always try to impose structure on a wiki page, but these structures are often optional, and few wiki systems prevent you from saving a wiki page if the required fields are missing. Wikipedia’s Infoboxes are an example of how wikis can be retrofitted to add structured data that can be queried.
Forcing Name Uniqueness
Although most wikis don’t allow you to assign types to new pages at creation time, they do have some other rules that knowledge graphs don’t have. When you create a new page, you need to give it a name. This name serves as the page identifier with a wiki. Now two pages can have the same name. In most PKGs, we have a similar rule. If you try to enter a new page with the same name, the software will tell you the page already exists. If you rename a page to be the same name as an existing concept, the system will ask you if you want to merge the names. Adding aliases (one or more labels for the same concept) is another challenge many PKGs don’t do well.
Applying Typed Systems to Notetaking Event Loops
Now that we have an idea of the critical difference between wikis and knowledge graphs, let’s ask the question, how can we accelerate the ability to make connections as we enter text into a PKG page? We will start with some simple examples and then do a deeper dive into some more complex topics.
The general pattern we use is to visualize characters typed on a keyboard. As the user types each new character, we apply a set of rules to assist the user in making the decision if there is an opportunity to add an edge to an existing concept. If there is, the users will use a character, such as a tab key, to confirm the relationships and auto-complete the remaining characters for the edge. If they don’t want to accept the suggestion, they can just keep typing.
The key is that when the user is typing, they often want to signal that a link to a specific typed concept is needed. You might already be familiar with this process if you use social media and want to reference a person or a concept using “@” to reference a person or a Twitter account. You might add hashtags using a “#” character at the end of your blog post to let recommender systems tie your blog post to people interested in a specific concept.
Once you signal to the authoring system you want to reference an existing node with a type, it confirms this as you type or suggest a list of auto-completions.
The key is that organizations have many types of data that you want to include in your edge recommendation. Our focus is finding ways to integrate these organization-specific types into the as-you-type loop in the text editors for our PKG tools. If you want to reference employees in your company, the “@’ might pull a list from your list of company employee names. Clicking on the link for that employee will allow the user to go to the PKG page for that employee.
Another common integration point is linking to a standard company glossary of terms and acronyms. Instead of using a special keyboard symbol, you can add a colon-separated prefix to a vertex name. For example, if you have a company glossary of terms, you can have your users use “g:” as a prefix, and the autocomplete will match those terms. If you are using an industry-specific glossary, you can reference those terminology standards in the prefix. For example, in healthcare, we use the “Just Plain Clear” glossary so that the prefix “jpc:” will auto-fill terms from that system.
Adding Custom Types
My suggestion when you are creating a PKG integration strategy, start small and build on your successes based on the value each integration adds to your user community. Start with simple employee lists and company glossary terms that are commonly referenced. Try experiments with things such as “g:” for a company glossary or “w:” for a Wikipedia term and see how frequently they are referenced. If there is low adoption of these suggestions, then they should be removed. Remember, people can always add external links to Wikipedia.
Acceptence Rate Monitoring
There have now been extensive studies about how large-language NLP models such as BERT and GPT-3 can be used to aid software developers by suggesting code. When a system makes recommendations to a user about a suggested autocompletion, if the user accepts the suggestion, this is logged in an event log. The parent of suggestions that a user accepts, called the acceptance rate, is critical for their adoption of these tools.
In general, if your users don’t accept around 1/3 of the autocomplete suggestions, the tools will become more annoying than helpful. Users will disable the tools, and you will not get the feedback you need to be successful. So it is critical to aim for a 30% or higher acceptance rate before you roll these tools out into production.
Permalinks and PURLs
Adding full URL links from your notes to any internal corporate resources is also something to be wary of. Many internal document management systems, such as Sharepoint(TM), depend on links to specific files within a hierarchical file system. When the permissions to these folders change, or the documents are moved, these links will no longer work.
A better choice is to only make links in your PKGs to locations that your organization has made commitments to not changing — ever. We call these links Permalinks or PURLs. This can be done by careful management of your internal domain-name system so that a link to http://glossary.mycompany.com/#term will always work, even if the servers the glossary is hosted on changes.
The next step is to ensure that your PKG integration team has tools that can monitor broken links and work proactively to prevent links from being broken as content moves around your organization’s servers. Consider creating a “value” of one dollar for a working link and help people under the knowledge value that is lost as links break and the time it takes to update broken links.
Most knowledge graphs support bidirectional links. If you change the name of a concept page, all the links to that page will automatically be updated. This is a HUGE win and amounts to super-consistent and low-cost relationship management.
As a general rule of thumb, 60% of the value of a knowledge graph is the nodes, and 40% of the value is the relationships. Broken links discourage authors from making future links, and users are more reluctant to trust systems with many broken links.
Integration with NLP Frameworks
One of the key developments is the rise of low-cost real-time tools that can analyze incoming text and look for keywords and phrases that are relevant to an organization. These processes include automatic classification, named entity extraction, and fact extraction. For example, if you typed in the term “yesterday,” a system could detect the current date and insert a link to yesterday’s date in your timeline so that you could see what notes referenced that date.
When the NLP integration occurs in the editing workflow is also relevant. Real-time inferences over large NLP models can be very expensive. As-you-type checks need to execute quickly. Adding keywords to a document at the end of the day is a much lower cost. In general real-time autosuggest services need to run in the 1/10th of a second range. These are 100 millisecond response times, and organizations that provide service-level agreements might have onerous charge-back fees that your user doesn’t want to pay for.
Other Integration Points
Integration with the in-charter-stream autosuggest and integration with real-time NLP analytics are the two areas that will dominate your PKG integration architectures and drive your PKG product decision points. However, there are a few other areas to consider.
Converting Markdown Extensions
Almost all PKGs today center on extending standard Markdown formats. Unfortunately, different vendors have picked different formats for these extensions. When you import or export Markdown between systems, you may need to add converters between these formats. For example, converting from Obsidian to Roam or vise-versa will require you to load converter extensions to do this work. Fortunately, the conversion scripts are pretty simple syntax changes and are well-documented. Many small Python programs are already available to do these conversions. Things to check for include external links, image links, metadata tags, text highlighting, and aliases.
There will always be a natural tension between quick free-form notetaking unencumbered by rules and the desire for the organization to reuse knowledge and make it consistent. PKGs, wikis, and enterprise knowledge graphs are all evolving in conjunction with the explosion of NLP tools and the growth of large-language models that suggest text within the context of a text editor.