Historical Data: The Gordian Knot of Machine Learning

By Chris Lee
Jun 4, 2020

Key Takeaways:

Historical ground truth data, which provide the context needed for machine learning systems to function, is hard to get and difficult to interpret.
Traditional approaches to dealing with this data gap amount to variations of trying to untie a Gordian knot: all lead to a lack of demonstrated value.
Time series AI provides a path to cutting the knot: getting better data from the richness of current context, instead of from the paucity of historical records.

The promise of machine learning is simple: Take some operational data, add some ground truth data (labels), put it into a black box and receive wisdom. The reality of machine learning is more complicated.

Data, especially the ground truth data used to label interesting events, is hard to get and can be difficult to interpret. This is enough of a challenge that the Pentagon’s Joint Artificial Intelligence Center calls it out explicitly in its RFI for predictive maintenance, taking up half of the challenge areas presented. We have also encountered this challenge first hand on many occasions. For example, in one case we had two years of maintenance records but could not make effective use of it:

Data entry issues made it difficult to interpret what had actually happened.
E.g. blank fields in the log data.
Lack of data structuring made categorization of events a challenge.
E.g. use of free-form text to capture the entire event instead of using a label per field.
Inconsistent or non-existent conventions for event naming made it difficult to provide accurate, useful event labels.
E.g. calling the same type of service action different things at different times. “Part replacement” and “maintenance” refer to the same type of service.
Single events being split into multiple parts made event labels unreliable.
E.g. a system calibration may be incorrectly diagnosed resulting in three entries instead of one. First it was recorded as “maintenance.” When that didn’t resolve the problem it was addressed again, this time as a “part replacement” and finally the correct action was taken and recorded as a “system calibration.” Note that two of these entries incorrectly label the event.

Because these events are from months or even years in the past, incomplete records make it very difficult to investigate what actually happened. Ground truth data is a veritable Gordian knot. Untying it to understand the thread of truth it contains is impractical. However, without ground truth data, machine learning systems quickly starve and the operational excellence project that depends on them stall as well.

We have seen two common responses to data starvation like this:

Brute force: Bring in “forward deployed” data scientists to help untie the knots.
Postpone the pain: Create a data lake to hold “everything,” develop policies to make sense of that data and try to untie the knots at a later time.

Both approaches have drawbacks.

Relying on data scientists does not scale. The data scientists can sometimes help demonstrate technical risk retirement but that will very commonly fail to lead to meaningful adoption. As we have related through our observations about increasing production with zero marginal cost predictive analytics and our experience with Agile approaches, it is important to have operational experts lead operational excellence projects.

Creating a data lake defers reckoning. The business practices which led to the inconsistencies and gaps will not go away on their own. Nor will the data in the data lake suddenly become enriched based on changes made a year from now. In the meantime, support for the operational excellence program can fade as technology and money is applied but few actionable insights are gained.

Instead of trying to untie the data knot, what if the problem could be solved by cutting it – by removing the need for historical ground truth data altogether? As we discussed in “The importance of working in the now,” our experience is that not only is there a way to do this, but that doing it this way, using machine learning with realtime data, is more effective than the alternatives.

Deploy a solution which uses data from current operations: Better data is data that you can explore and enrich today.
Enable the operations experts to directly address problems that are affecting them now.
Use your existing operational teams to scale this approach throughout the operation.

Our approach is embodied in our Time Series AI platform. Cut the knot.