Follow Datanami:
October 11, 2012

Expedia Adds Notes to Big Data Symphony

Nicole Hemsoth

These days, using an online travel booking service to bundle a packaged travel deal, flight, hotel and car, many of us don’t stop to think about the complexity of services that back our purchases.

It’s simple to overlook the symphony of fast-changing pricing models, flight information, hotel details that are matched immediately to our location that hums along to provide real-time results for even the most complicated bundled journey request.

And that’s just on the backend—on the customer facing side, another masterpiece is being conducted based on geographic data from users and their locations, previous travel histories, customer scores, travel habits, social media details and the host of other (sometimes creepily) targeting data that is fed into the travel site experience.

If you’re in the business of large-scale web services, maybe this does cross your mind as your make your purchase, but the point here is the technology abstraction, personalization, and real-time responses behind our queries mask untold layers of complexity—and that complexity gets more tangled as the older such services evolve, shedding and gaining new systems and frameworks along the way.

This week we were on site in Las Vegas for the SAS Premier Business Leadership Summit, which brought together hundreds of folks in that camp most likely to spend a moment marveling at the technology experience of using such booking service. Among that set was Joe Megibow, VP and General Manager at, who told me that when we’re talking about $11 billion in yearly bookings and 45 million web visitors per month across infrastructure that’s evolved over the course of 16 years ,there are some rather unique, tough data problems to solve.

This symphony hits a crescendo when you consider that beyond the Expedia site (which is the U.S. operation), the company operates in 22 other countries while integrating global travel information systems to conduct a concert of worldwide travel options delivered in real-time.

At the high level, Expedia relies on a large number of systems and software approaches, combining everything from unique big data systems to traditional warehousing to a wide variety of statistical modeling and business intelligence tools. Megibow says it’s an imperative that they “do data right” as it’s the pure bread and butter of their business—on both the booking engine sides and the user end. As one might imagine, however, a 16-year process of data evolution from what started within the confines of a software business (Microsoft) is its own story.

“We’re an old school internet site and our platform reflects that since we have a lot of stuff we’ve been carrying around for many years that you don’t see much anymore,” Megibow told us, noting that while they may have some leftovers in their data fridge, they’re doing some cutting-edge work that is pushing them past their competition in terms of performance, user experience, and even data management on the backend.

With continuous evolution of the data environment comes a constant cycle of tweaking and integration as the old frameworks are forced to fit in new boxes. Add to this an active test and development environment for trying out new technologies to boost big data management, ingestion and analysis and  travel booking vendors are left with is a mismatched set of systems that need to be fine-tuned to operate in harmony—or migrated off older environments completely.

Megibow says that as the space as continued evolve, they’ve moved from a 100% Microsoft shop to a Java-based environment. Additionally, the company deploys a large, scattered vendor environment that he says taps the best of breed tools for solving specific problems that they weren’t able to refine to production quality in house.

To put this in perspective, Megibow gave us a broad sense of the far-reaching vendor ecosystem they look to for powering all of this complexity, noting that they use traditional SAS for much of their statistical modeling—which is a core part of their business, he notes, and is powered by some PhD data scientists.

Beyond model development and building production models, Expedia uses DB2 for data warehousing, and use other purpose-driven products for specific tasks, including SAP Business Objects, Tableau, Tea Leaf (Megibow was on the founding team here), ClickView, Adobe Omniture, Google Analytics, and far more that he didn’t name. He says that while they were originally built on a Microsoft technology base, their evolution has severed many of the deep technological roots in those original tools and approaches. Hadoop is often a solid fit for the web giants whose sole purpose is in counting data—and Expedia is no exception.

According to the Expedia VP, right now they have an (approximately) 200-node Hadoop cluster, the use of which he expects will grow from 50% to 100% per year on target with data volume growth. They’ve been experimenting with the open source distribution of Hadoop, but says that as they continue to grow, they will keep looking to the distribution vendors (Hortonworks, MapR, Cloudra, etc.) and possibly make a shift to a supported version.

“We found that a lot of people were loading the data into their data warehouse then used Hadoop as an afterthought—we bucked that trend and started using it as the front end of our ETL so now almost everything we get goes through Hadoop.” While clearly it’s not used for some of the real-time operations for customer booking, it’s at the core of the feedback loops and other user experience and engagement processes, which serve as the heart of their continued business.

The Hadoop cluster is handling real-time streams, not processing, notes Megibow. “If you’re shopping for Vegas hotels, but don’t buy, you’ll get an email later that day and find that your web experience points you back to us. There are feedback loops galore,” many of which are powered by the expanding Hadoop system.

Feedback loops and the many optimizations that they’ve been able to discover through a constant focus on analytics have allowed them to push into new realms of operational efficiency. For instance, extensive understanding of what makes a user give up during registration can be as simple as finding that putting the “last name” box beside (versus under) the “first name” led to an increase of $1 million in business since users weren’t forced to keep revisiting the registration page to “get it right.”

The optimization side of using analytics and models to understand behavior, preferences and experience is one division, but the impressive part of this business lies in the real-time churning of updating, constantly changing flight and hotel information in the context of the greater optimization and user experience.

Next — Snapping the Pieces Together >


On the flight booking side, Megibow says that the engines that handle that traffic flow have been purpose-built and have been with them since the beginning. There are really only two major air pricing engines: ITA, which Google bought, and Expedia’s custom-built solution that ties in services from the world’s largest travel booking information sources (Amadeus and Sabre are two components). These engines pull together streams of airline pricing and feeds using a complex algorithm that allows Expedia’s competitors to tap into the same datasets.

For this air booking piece, he says their own algorithms have been cooking for some time and they’ve now reached a point where they’ve consolidated the number of servers chewing on these tasks following a move to a 1000-node commodity cluster. At this point, he says, it’s down to the software to find real-time picks of the best 100 or 200 flights where inventory and pricing haven’t already changed during the query.

ITA and Expedia’s air pricing engines are only one small piece of the mega-booking site’s operations. The hotel booking processes run on their own cluster with its own pricing and modeling engine. Within this layer are new sets of complexities that require some mind-boggling math. For example, he points to their hotel recommendation process, which shows users hotels near them if they’re on mobile devices—and tries to set them up with the ideal location if they’re planning in advance. There is an entire GIS piece to this puzzle that takes some complex operations.

Consider this example—you’re traveling to a big data conference in Florida next month and booked (the old fashioned way) a hotel from Google’s distance results. Before booking, you Google hotels near the center to see what’s nearby without looking at the map view, instead focusing on hotel names and proximity. Self-satisfied, you book the one that Google’s text results tell you is one mile away. Great. Well, unfortunately, little did you know, that hotel truly is just a mile from the convention center, but only if you cross the channel, which means you’re forced to either take a ferry that only comes once per hour or take a cab that winds around the entire bottom of the peninsula, making one mile into about fifteen.

This happens—and even happened to people at the beginning of online travel booking. That is, until services like Expedia found some complex systems that “score” individual hotels based on access, proximity to the desired location, understand whether something is actually on the beach (or has a beach view from across the highway). Megibow says that no one should underestimate what a difficult process this is, especially as established users with defined travel patterns the company understands book travel based on what Expedia knows they look for. In other words, in real-time, the results of this complicated hotel scoring, updated price information, location details, and a bevy of other data are operating in harmony to deliver instant, tailored results.

Consider all of the airline and hotel booking processes that work in unison for a short business trip, but add in a car to make it a travel package with an emphasis on the best price on all three services. “There’s a ton of data behind” he tells us—and a lot of complexity that is forcing them outside of the box when it comes to looking at new data management, processing and even storage capabilities.

Online travel booking is certainly nothing new, but the competition is finding some interesting approaches that leverage some of the same tools in unique ways. As a completely data-driven business, there is no downtime allowed—evolving has to happen on-the-fly and in concert with a disparate set of systems and processes that must harmonize in near real-time, or at least be able to reel a wayward customer back.