I believe that within five years there will be dramatic growth in a new field called Knowledge Science. Knowledge scientists will be ten times more productive than today’s data scientists because they will be able to make a new set of assumptions about the inputs to their models and they will be able to quickly store their insights in a knowledge graph for others to use. Knowledge scientists will be able to assume their input features:
- have higher quality
- are harmonized for consistency
- are normalized to be within well-defined ranges
- remain highly connected to other relevant data as such as provenance and lineage metadata
Anyone with the title “Data Scientist” can tell you that the job is often not as glamorous as you might think. An article in the New York Times pointed out:
Data scientists…spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
The process of collecting and preparing data for analysis is frequently called “Data Janitorial” work since it is associated with the process of cleaning up data. In many companies, the problem could be mitigated if data scientists shared their data cleanup code. Unfortunately, many companies use different tools to do analysis. Some might use Microsoft Excel, Excel plugins, SAS, R or Python, or any number of other statistical software packages. Just within the Python community, there are thousands of different libraries for statical analysis and machine learning. Deep learning itself, although mostly done in Python, has dozens of libraries to choose from. So with all these options, the chances of two groups sharing code or the results of their data cleanup are pretty small.
There have been some notable efforts to help data scientists find and reuse data artifacts. The emergence of products in the Feature Store space is a good example of attempting to build reusable artifacts for data scientists to become more productive. Both Google and Uber have discussed their efforts to build tools to reuse features and standardize the feature engineering processes. My big concern is that many of these efforts are focused on building flat files of disconnected data. Once the features have been generated they can easily become disconnected from reality. They quickly start to lose their relationships to…