Too many big data initiatives are science projects that take months of effort, risk failure and require highly trained data scientists with scarce skills. According to a CSC survey, 55 percent of big data projects aren’t completed and many others fall short of their objectives.Read more...
How Hadoop is Remaking Travel and Expense Reporting at Concur
If you’re like most people, filling out an expense report ranks right up there with getting a haircut or visiting the dentist. But thanks to the advanced analytics work that Concur is doing with Cloudera and Hadoop, the expense report process is becoming not only more enjoyable for business travelers, but more helpful too.
Concur started working with Hadoop about three years ago as a place to combine and “munge” all the disparate data sources that the company deals with. With more than 20,000 clients serving 25 million employees around the world, data volumes are certainly big. But the sheer variety of the travel and expense-related data—every coffee-smeared IHOP receipt and crumpled airline invoice–is a much bigger issue, and the source of the company’s initial foray into the world of Hadoop.
Having all the data in one location allows Concur to do things that would have been very difficult using relational technology, according to Denny Lee, the senior director who heads up the data analytics team at the Bellevue, Washington, company.
“We’re able to answer questions and provide degree of personalization and customization that we were not able to do before,” Lee tells Datanami. “We have the ability to merge a lot of data from a lot of disparate sources with a lot of different formats in a way that allows us to query it and make sense of that data.”
Like many companies, Concur is currently transitioning its Hadoop cluster from the early proof-of-concept phase into a bigger production system that the company will depend on to harness big data. As part of that transition, it recently hired Lee, who was part of the nine-person team that implemented first Hadoop cluster running at Microsoft. It’s also moving up from free Hadoop software to Cloudera’s Distribution of Hadoop (CDH), which gives it technical support and the confidence to grow the cluster.
The company’s 40-node Hadoop cluster currently has about 60TB of data—not huge by any means. But that system is expected to grow rapidly as new prediction, personalization, and recommendation systems come online. It will likely be in the petabyte range within a few short years, Lee says.
Hadoop is a central component to Concur’s plans to improve the travel and expense reporting process and experience for its clients. While Concur is best known for its hosted expense reporting service, it also provides travel booking services, a la Expedia and Kayak. Since travel is typically the biggest line item on expense reports, it made sense to combine them, which Concur did with a 2005 acquisition that got it into the travel business.
On the expense reporting side of the house, Hadoop will provide Concur with two main advantages: assisting with the classification of expense report items, and by providing personalized recommendations to business travelers.
“We will make the mundane experience of filling out your expenses and expense reports easier,” Lee says. “We have a feature called ‘Expenseit’ where you take a picture of a receipt, and it will go ahead and auto-fill the expense report for you, as opposed to you typing all that stuff in.”
Hadoop is not yet powering that level of automatic classification; that feature is still in development. But in the future, Lee expects to leverage the power of machine learning algorithms for large-scale data classification work. Part of the challenge in implementing such a system is that every company has its own expense reporting policies. A Starbucks coffee will be classified as a beverage at one company, while it will be dropped into the meal bucket at another.
“That’s where the Hadoop cluster comes in, because I have enough data not just from you, but also your company and other companies that have similar enough expense types, purely in a non-identifiable way,” Lee says. “From a data science perspective, we’re trying to apply machine learning algorithms to basically figure that stuff out, to see if we can be a heck of a lot more precise and more flexible than what a rules engine would apply. Especially with the number of customers and variations by industry, machine learning seems to be a much more accurate way to solve this problem than a rules engine.”
The company is also using its Hadoop cluster to power personalized recommendation systems for its expense report and travel booking service. This system will enable the company to recommend certain flights, hotels, or restaurants to customers. The recommendations will be based on a variety of circumstances, including the customer preference models that are created in CDH and pushed out to an externally facing Website application powered by a Couchbase NoSQL database.
“It will help from the standpoint of improving recommendations for your travel, whether it’s shorter or faster, or booking a hotel that’s closer to where you want to be,” Lee says. “If you go to a new city, we can say, ‘Here are the restaurants you would prefer based on your preferred cuisine types or the expenses that you were reimbursed.’”
Concur’s Hadoop cluster is currently generating personalized travel and hotel recommendations for clients via the Couchbase-powered website. For a given flight, instead of returning 1,000 potential options, Hadoop will weed out the ones that don’t fit, and push flights that best match the customer’s profile to the top of the list. Lee’s group is currently working on the restaurant recommendation system, which has not yet been rolled out.
There are more things Concur can do with optimizing the travel recommendations, especially as it pertains to the weather. “If we happen to know that an airport has more delays due to weather disruptions, we can recommend that you fly on a different day,” Lee says. “It’s not something we’ve done yet, but it’s on our radar, because that’s what we’re trying to do–take all this data to provide better recommendations to make your business travel easier.”
Lee and his group at Concur are big users of R, Python, and Scala. Lee, who hosts the Seattle area Spark meet-up, is a big fan of the in-memory technology, and is eager to incorporate its various components–particularly the machine learning and graph analytics libraries—to provide more real-time data processing at Concur.
Concur is uniquely positioned to leverage the large amounts of travel and expense-related data that its customers generate, in pursuit of improving the services it provides to those same customers. Without Hadoop, it would be difficult to act on all that data without getting bogged down by the technical details.
“We have enough smart people in this organization that we could probably figure this out,” Lee says. “But I don’t want to. It’s not worth the effort. The ease for us to be able to do this with CDH allows us to focus on the data science….as opposed to worrying about the infrastructure aspect or the joining aspect or the getting-different-systems-to-talk-to-each-other aspect.
“And ultimately, of course, the whole reason we started down this path was to realize not the technological value, but the immense business value Hadoop was going to bring to us,” he continues. “Because we’re already solving problems that are a lot easier to solve because we had Hadoop.”