The world of big data analytics is incredibly diverse, and people are coming up with new analytic tools and techniques every day. But one particularly productive combination that should not be overlooked involves the use of text analytics and machine learning.
Tom Sabo, principal solutions architect at analytics giant SAS, says the one-two punch of predictive modeling on structured data, and text mining with unstructured data, can deliver insights that are more than the sum of their analytic parts.
“They really run side by side,” Sabo tells Datanami. “Let’s say somebody has predictive models in place against whether customer will churn or to maximize profit, for instance. If they have text, like notes, in the rest of that structured data…we can incorporate that additional free form information for actionable insight.”
This is particularly true when using rules-based text analytic routines to extract sentiment from large sources of unstructured text. Once the machine learning algorithms identify customers who are more likely to switch providers, the company can then run text analytics on notes, comments, or other source of textual data to help answer the most valuable question: Why?
The return on this type of analytic investment can be steep, Sabo says, particularly when visualization tools are brought to bear. “You can get results so quickly with that kind of information,” he says. “I love going around and seeing the variety of clients who actually have data in that kind of format and helping them get results quickly.”
Structured Vs. Unstructured Data
What’s interesting about this approach is that it works on both ends of the data structure spectrum. Much of big data analytics is about turning unstructured data into structured data, but progress doesn’t necessarily happen in a straight line.
Consider the data sources for predictive models. In the case of a customer churn model, you’re likely going to feed lots of structured data into the system, including demographic data about the customer, like ZIP code, income, gender, and occupation; the customer’s purchase history, including frequency of purchases, the value of purchases, and date of last purchase; data about the products the customer has bought; and details of any customer service interactions.
Most of this information exists in structured format, and can be pulled out of the relational database powering the enterprise commerce application. When loading the model, the data scientist will assign weights to each value, and the algorithm will generate predictions based on the actual history of other customers like him or her, and any changes to these details that occur.
By contrast, the sources of free form text for sentiment analysis are not so clear cut. The CRM system may have a place for storing comments from individual customers, but it may not. To supplement potentially sparse free-form customer data from internal sources, a company may look elsewhere for insight, including publicly social media.
Sabo says social medial like Facebook and Twitter are good sources for free-form content, while blogs and forums excel for specific domains. “For instance, analyzing how our adversaries might be misusing drone technology,” says Sabo, who has worked with clients in the defense industry. “I can look at a variety of blogs and forums that are related to that and extract a lot of information.”
Text Analytics at CFPB
Sabo recently used this virtuous combination of text analytics and machine learning to explore patterns in data gathered by the Consumer Financial Protection Bureau (CFPB), which was created in the wake of the 2008 mortgage meltdown that triggered the Great Recession.
The data described complaints that people made against banks, credit card companies, and other financial services firms, and included a combination of structured data (like disposition codes), along with unstructured text in the form of comments by the individuals themselves.
SAS’s Tom Sabo presenting at the Sentiment Analysis Symposium earlier this year
Sabo’s analytical approach involved using text analytic techniques to identify patterns between degrees of negative sentiment in the free-form data collected by the CFPB. Then he used a model to compare the degree of negative sentiment with whether the individual received compensation by the offending bank.
The results indicated that there was, in fact, a correlation between expressions of negative sentiment in the CFPB’s comments, and recompense for the aggrieved individuals. In particular, when somebody used the word “steal” (or another word a similar meaning), there was a greater likelihood for compensation.
This exercise demonstrated the power of the combination of rules-based text analytics and machine learning because, as Sabo noted, “The name of the organization is in free form text. It’s not in the structured data.” (You can watch Sabo’s presentation on the CFPB data at the Sentiment Analysis Symposium at this link.)
The tools and skillsets required to build predictive models and analyze text are not necessarily the same. As a SAS employee, Sabo obviously likes SAS Text Miner, which has been helping companies and government agencies for decades. But there are many other tools out there too from companies like IBM, Microsoft, Clarabridge, Lexalytics, KNIME, RapidMiner, as well as open source options in the form of Apache Mahout, Apache Stanbol and others.
Once text analytics are introduced to illuminate free-form text, there are many other directions a company can go. For instance, link analytics, which is sometimes associated with graph analytics, may be the next step in helping a company to find which words (i.e. which sentiments) are correlated.
“All text analytics does is generate structure where there was no structure,” Sabo tells Datanami. “Once you create that structured data, creating the connections between it becomes the next step.”
Obviously, the skills sets required to create predictive models and to perform sentiment analyses aren’t always the same. And taking the next step into link analysis may require a deeper understanding, including how to program rules for natural language processing (NLP) engines.
No matter where you set your big data analytics flag, it’s good to know where others have gone before you. There’s a huge abundance of data sources available today from internal and external sources, and the analytical options can be intimidating. But if you’re building predictive models with machine learning technology, it’s worth knowing that the addition of text analytics could be a force multiplier as you set out to leverage big data.
Discover Actionable Insights from Unstructured Data
Beyond Social: How Text Analytics Can Improve Business
Pulling Insights from Unstructured Data – Nine Key Steps