Selecting the Right Database for the Right Job
The Internet and mobile devices have created tremendous opportunities to engage with consumers in new ways. At the same time, they have brought unprecedented challenges for managing the vast volumes of structured and unstructured data being pumped in from these channels.
At Adform, we face those opportunities and challenges first-hand. We enable media agencies, trading desks, advertisers and publishers to easily scale and deliver programmatic display advertising, rich media, and video, across desktop and mobile devices. Offers and other content are customized on the fly for different consumers based on their history, what Web pages they have just visited and what they are clicking on now.
If this sounds like too tall of an order for one data management system, it is. Instead, our experience is that different data management systems are tuned for different demands. There are obvious distinctions between back-office analysis provided by data warehouses and Apache Hadoop, SQL relational databases for structured data, and NoSQL for managing both structured and unstructured data. However, we have found that even different NoSQL databases are better aligned with various functions within the organization. With that in mind, following is an overview of the data management systems that we currently have deployed at Adform, as well the roles they play.
Backend Data Analysis
We rely on an open-source Apache Hadoop system to analyze the vast amounts of data that comes in from the Internet and mobile interactions. Before being ingested into Hadoop, data is validated, enriched and attributed by a distributed ETL system. We use Hadoop because it does not have any restrictions on the data, it is ideally suited to handle this information, which is a wide mix of structured and unstructured data, much of it schema-less. Additionally, the distributed computational capabilities of Hadoop provide an elegant way of handling the 300 terabytes of data we store in the system. We use this data for exploratory analytics and offline raw data exports used by our clients. We also combine analysis from our Hadoop system with real-time data in Aerospike to create the most complete picture of the consumer possible. To speed up requests to the data warehouse we use HP Vertica, which provides a robust resource for SQL analysis of transactional data and reporting to the customers.
Interactive, Real-time Data
Aerospike, an open-source in-memory NoSQL database, sits right behind our Web servers and serves as the main underlying repository for our real-time trading engine, as well as our dynamic creative engine. We rely on Aerospike’s fast key-value store (KVS) capabilities for the user profile store supporting our data management platform (DMP), ID mapping, and dynamic creative optimization. We also take advantage of Aerospike’s combination of indexes in RAM and data in direct-attached SSDs for near in-memory speed at flash storage prices. The result has been stable and predictable low latency in a system that is easily scalable.
Our real-time trading cluster now processes a few hundred thousand requests per second, a majority of which are auction-based requests where Aerospike supports the calculation of bids on several thousand bidding strategies and campaigns. Meanwhile, our creative engine uses much of the same data for determining which product or creative design is the best to display to each individual user. All user interactions with online ads are stored in Aerospike. Then this data, combined with additional data from a large Hadoop cluster, is used to make bidding decisions in real time.
We maintain an Aerospike cluster in each of our two data centers, which are connected via an optical line. Running on bare metal, each Aerospike cluster now uses 96GB of data in DRAM along with 1.7TB of data on Intel SSDs. Overall, Aerospike handles more than 200,000 operations per second with a latency of less than 1 millisecond.
Since we operate with a variety of data in many different contexts with many different access patterns–relational data, high read/write volume, metadata, etc.–we use different databases for applications and analytics that drive our business.
Today, the most widely used databases within our organization are Microsoft SQL Server, Cassandra and MongoDB. Such a variety in technologies goes in sync with our mindset – use the right tool for the right task.
Success in Diversity
Back in the “Origin of Species,” Charles Darwin predicted that greater diversity within an environment would lead to greater productivity. Since then, scientific experiments have proven that theory to be true.
We have seen the parallel in our own IT environment where we use a number of different databases that we find useful for different purposes. By applying the strengths of each data management system to the various applications within our organization, we have created an environment that has supported our ability to double in size each of the last few years. More importantly, it positions us to continue rapidly expanding and evolving as we move forward.
About the author: Jakob Bak is a co-founder of Adform and one of the leading technology architects of the Adform platform. Prior to co-founding Adform, Jakob worked as a Management Consultant at The Boston Consulting Group and as a research assistant at the Technical University of Denmark. Jakob holds an MSc in Engineering from the Technical University of Denmark, and also attended a few semesters at USM in Malaysia. He especially enjoys travelling, which brings him to Asia and other continents often several times per year. Jakob is working from the Adform London office.