Follow Datanami:
July 21, 2015

From Data Lakes to State Analytics

James Dixon

(Risto Viita/Shutterstock)

Earlier this year, I published my thoughts around a data lake use case that I called “The Union of the State.” Since this post published, the topic of data lakes and data analytics has become more widely debated than ever before. And the more I’ve thought about it, the more I’ve read and heard from industry influencers such as Dan Woods and Adrian Bridgwater, I’ve come to realize something – there is an entire branch of analytics that no one is talking about: State Analytics.

State Analytics, obviously, is the analysis of the state of things. It is not the near-term analysis and reaction to changes in state (that’s Real-Time Analytics). Nor is it the analysis of long-term aggregations of state (that’s traditional time-series analysis). It is the analysis of the current state of things or people, with the optional ability to be able to do the analysis at any point in history.

There are existing fields of analytics that fall under the category of State Analytics. Spatial analysis is one of these. Geographic mapping of the location of devices, employee or customers is a visualization of a state attribute. Many enterprise applications are using a state machine under the covers; this includes ERP, CRM, call center and case tracking systems. Therefore, some reporting and analysis of these data sets falls under State Analytics.

In order to do State Analytics, we need a repository that stores the state of the devices, people, or accounts we are interested in. If you work in industries where CRM or ERP systems are popular, this seems obvious, since those applications centrally store the state of every entity. The important part of this is “centrally.” In the Internet-of-Things (IoT) field, the “things” (devices such as mobile phones, engines, pumps, cars, etc) have a lot of state attributes. But, in this case, the state of the things is distributed and stored on the things themselves. There is no inherent central store of the state. If the IoT back-end does not provide a central state repository, then State Analytics is not possible.connected devices

Predictive Analytics provides a good example for the need for State Analytics in IoT. Consider this question from a business:

“Under what conditions are our Super-Maxi-Squirt pumps likely to fail?”

To answer this question, a data scientist uses a Data Lake of past events and creates a model that predicts the conditions under which the pumps are likely to fail.

“That’s great,” says the business, “You must be pretty clever. Which pumps are likely to fail first?” it continues.

“No idea. Sorry,” replies the data scientist, “We don’t know the current state of every pump. We only look at and store the stream of events from them.”

“Hmm, you’re not so clever after all,” says the business, and it stomps off to sit in a corner and occasionally glares at the data scientist.

In the case of an IoT system, the back-end should include a repository that stores the collection of attributes for each device. As data streams in from the devices, the last known value of the attributes of the devices is updated in the state repository. Since there will be a large volume of incremental updates to individual attribute collections, a NoSQL repository is probably ideal.

It appears that we have been doing State Analytics in some market segments for decades without even knowing it. Since we’ve been doing it for so long already, why even bother identifying the category? Because when we define the category, we can create best practices and templates (or blueprints). We can also look for examples that represent the “state of the art” of State Analytics.

When we look at the category as a whole, we might discover the best examples in one industry and be able to recommend them to others. For example, the best State Analytics today might be found in petroleum processing plants. I’m betting that no one in CRM, ERP, or call centers is looking at chemical plants for examples of how to do better analytics in their industry. We can also look at every industry and market segment for instances where State Analytics is sometimes not done at all, for example in Internet-of-Things, and recommend the addition of a state repository.

james dixon

About the author: As CTO at Pentaho, James Dixon is responsible for Pentaho’s architecture and technology roadmap. He is credited with originating the concept of the data lake, and has over 15 years of professional experience in software architecture, development and systems consulting. Prior to Pentaho, James held key technical roles at AppSource Corporation (acquired by Arbor Software which later merged into Hyperion Solutions) and Keyola (acquired by Lawson Software). Earlier in his career, James was a technology consultant working with large and small firms to deliver the benefits of innovative technology in real-world environments.

 

Related Items:

Scoring Hidden Insights from Location Analytics

It’s Sink or Swim in the IoT’s Ocean of Bigger Data

Datanami