Follow Datanami:
September 1, 2014

Five Steps to Running ETL on Hadoop for Web Companies

Shanti Subramanyam

Mention ETL (Extract, Transform and Load) and eyes glaze over. The thought goes: “That stuff is old and meant for clunky enterprise data warehouses. What does it have to do with my Internet/Web/ecommerce application?”

Quite a lot, actually. ETL did originate in enterprise IT where data from online databases is Extracted, then Transformed to normalize it and finally Loaded into enterprise data warehouses for analysis. Although Internet companies feel they have no use for expensive, proprietary data warehouses, the fact of the matter is that ETL is still a requirement and so is some kind of a data warehouse. The logic is simple: one doesn’t run business reports on the database powering the online application.

An ETL Example

Consider the classic example of key transformation. The application database uses a customer_id to index into the customer table, while the CRM system has the same customer referenced differently. The business analyst wants to analyze how customers are using the product and thus, the data warehouse needs a distinct way to refer to customers i.e. the keys need to be transformed and mapped to a new key in the DW. Even if there is a single source system, it is still a good idea to do such transformations to isolate the warehouse from the online database. In addition to such basic transformations, data is also often enriched (as for example using geocodes) to create the target customer record in the warehouse. There is no getting away from it: ETL is a requirement whether you are the hottest social media company or a 50-year-old bank.

Why Hadoop?

All right. We need ETL. But what has Hadoop got do with it?

Turns out that Hadoop is an ideal platform to run ETL. You can feed the results into a traditional data warehouse, or better yet, simply use Hadoop itself as your warehouse. Two for the price of one! And ingesting data from all sources into a centralized Hadoop repository is future proof: as your business scales and the data grows rapidly, the Hadoop infrastructure can scale easily.

The Hadoop platform has tools that can extract the data from the source systems, whether they are log files, machine data or online databases and load them to Hadoop in record time. It is possible to do transformations on the fly as well, although more elaborate processing is better done after the data is loaded into Hadoop. Programming and scripting frameworks allow complex ETL jobs to be deployed and executed in a distributed manner. Rapid improvements in interactive SQL tools make Hadoop an ideal choice for a low cost data warehouse.

Got it. What needs to be done to get this all to work? Read on to find out.

ETL Process in Hadoop

An architecture for setting up a Hadoop data store for ETL is shown below.

5 steps for ETL Hadoop

Here are the typical steps to setup Hadoop for ETL:

  1. Set up a Hadoop cluster,
  2. Connect data sources,
  3. Define the metadata,
  4. Create the ETL jobs,
  5. Create the workflow.

 Set Up a Hadoop Cluster

This step can be really simple or quite difficult depending on where you want the cluster to be. On the public cloud, you can create a Hadoop cluster with just a few clicks using Amazon EMR, Rackspace CBD, or other cloud Hadoop offerings. If the data sources are already on the same public cloud, then this is obviously the no-brainer solution.

If however your data sources happen to be in an in-house data center, there are a couple of things to take into consideration:

  1. Can the data be moved to the cloud? Legal, security, privacy and cost considerations apply.
  2. Can test data be used for development?

If the answer is No to both questions, then a cluster will need to be provisioned in the data center. Go befriend your IT/OPS guy right away.

Connect Data Sources

The Hadoop eco-system includes several technologies such as Apache Flume and Apache Sqoop to connect various data sources such as log files, machine data and RDBMS. Depending on the amount of data and the rate of new data generation, a data ingestion architecture and topology must be planned. Start small and iterate just like any other development project. The goal is to move the data into Hadoop at a frequency that meets analytics requirements.

Define the Metadata

Hadoop is a “schema-on-read” platform and there is no need to create a schema before loading data as databases typically require. That does not mean one can throw in any kind of data and expect some magic to happen. It is still important to clearly define the semantics and structure of data (the “metadata”) that will be used for analytics purposes. This definition will then help in the next step of data transformation.

Going back to our example of the customer ID, define how exactly this ID will be stored in the warehouse. Is it a 10 digit numeric key that will be generated by some algorithm or is it simply appending a four digit sequence number to an existing ID?

Many Hadoop projects are begun without any clear definition of metadata. Just like ETL, the term “Metadata Management” is considered old school and meant for traditional enterprise IT, not for our modern data architecture. But in reality, metadata is crucial for the success of Hadoop as a data warehouse. With a clear design and documentation, there is no ambiguity in what a particular field means or how it was generated. Investing up front in getting this right will save a lot of angst later on.

With the metadata defined, this can be easily transposed to Hadoop using Apache HCatalog, a technology provides a relational table view of data in Hadoop. HCatalog also allows this view to be shared by different type of ETL jobs, Pig, Hive, or MapReduce.

Create the ETL Jobs

We can finally focus on the process of transforming the various sources of data. Here again, multiple technologies exist: MapReduce, Cascading and Pig are some of the most common used frameworks for developing ETL jobs. Which technology to use and how to create the jobs really depends on the data set and what transformations are needed. Many organizations use a combination of Pig and MapReduce while others use Cascading exclusively. Learn about all the different ways transform jobs are done and the strengths and weaknesses of the various technologies.

A word of caution – engineers experienced in enterprise data management may be prone to aggressive data cleansing and transformation. They want order and the data to confirm to pre-defined schemas. However, the whole notion of big data is that it can be unstructured. Machine and sensor data are likely to be noisy, and social media and other data may not fit into neat buckets. Too much cleansing can get rid of the very insights that big data promises. A thoughtful approach is required to get the most value from your data.

Create the Workflow

Data cleansing and transformations are easier done when multiple jobs cascade into a workflow, each performing a specific task. Often data mappings/transformations need to be executed in a specific order and/or there may be dependencies to check. These dependencies and sequences are captured in workflows – parallel flows allow parallel execution that can speed up the ETL process. Finally the entire workflow needs to be scheduled. They may have to run weekly, nightly, or perhaps even hourly.

Although technologies such as Oozie provide some workflow management, it is typically insufficient. Many organizations create their own workflow management tools. This can be a complex process as it is important to take care of failure scenarios and restart the workflow appropriately.

A smooth workflow will result in the source data being ingested and transformed based on the metadata definition and stored in Hadoop. At this point, the data is ready for analysis. And you guessed it! There are many different ways to do that with Hadoop; Hive, Impala and Lingual provide SQL-on-Hadoop functionality while several commercial BI tools can connect to Hadoop to explore the data visually and generate reports.


We are finally done! We have created a data warehouse in Hadoop. Although this seems complicated (depending on the data and requirements), almost all of the technologies are open-source and available for free. Tools are now emerging that help automate some part of this process. If your organization does not have the expertise, it may be a good idea to engage outside services to get started on this new architecture and technologies, while hiring/training your own staff.

Data warehouses are a requirement even for Web/Internet companies. Data cleansing, data transformation, ETL, metadata are all terms that are still relevant for new data architectures. But they don’t need to be created using proprietary, expensive products. Leveraging big data technologies such as Hadoop will ensure your data architecture stands the test of time (at least until the next big wave!)


About the author: Shanti Subramanyam is founder and CEO at Orzota, a big data services company whose mission is to make Big Data easy for consumption. Shanti is a Silicon Valley veteran and a performance and benchmarking expert. She is a technical leader with many years of experience in distributed systems and their performance acquired at companies like Sun Microsystems and Yahoo! Contact Shanti via twitter @shantiS , @orzota, or


Related Items:

What Exactly Is Big Data, If It’s Neither About Big Nor Data?

Why Hadoop Isn’t the Big Data Solution You Think It Is

Five Steps to Demystify Big Data Analytics