The Top Three Challenges of Moving Data to the Cloud
Most data-driven businesses have already or are looking to move their data from on-premises databases to the cloud in order to take advantage of its unlimited, on-demand storage and compute cycles. Implementing cloud warehouses and analytics/BI platforms enables businesses to connect disparate data silos in real-time for increased agility, better decision-making, and to create a competitive edge.
However, moving data to the cloud poses some unique challenges, notably designing and maintaining the data schema; managing input and output failures; and ensuring data integrity.
No. 1. Get the ball rolling
Building a pipeline from your internal resources to the cloud can be done — but it is typically time-intensive, full of seemingly endless details, and fraught with potential for errors.
Every pipeline starts with a simple script, like copying data from A to B. But that’s where the simplicity ends. For example, if a data engineer has to import data from server logs to a cloud-based destination such as Amazon Redshift, the challenge can become quite complex.
First, an engineer will have to spend a couple of days to familiarize themselves with Redshift’s documentation for loading data and monitoring files and directories for changes. The solution is a pretty straightforward script, easily implemented in Python. Next, they will have to monitor a directory for new files, converting each file to Redshift’s acceptable format.
Still, the solution can take weeks’ or months’ worth of coding to implement, even for someone experienced with Redshift and Python. Unfortunately, the grunt work doesn’t end there. The ‘happy path’ in data pipelines is never followed and when things don’t go to plan, not only does data leak, but even worse an engineer will need to wake up in the middle of the night to fix it.
Converting schemas (aka ‘parsing’) is a tedious job most programmers hate. It requires meticulous attention to every detail of the data’s formats — Do the commas have a space after them? Do the timestamps have milliseconds and a timezone? Does the number always contain a decimal point or just sometimes?
No. 2. Managing Input and Output Failures
So far so good, so to speak. If the engineer has handled all the different schema issues correctly and data analysts are happy with how the tables are structured; some sort of schema version management solution should have been implemented. Things should be on track.
Unfortunately, leaks will still need to be fixed since inputs and outputs have their own set of regular failures. The leaks can come from just about anywhere in the pipeline, beginning with the directory monitoring script, which is very unlikely to be error-free.
Other potential pitfalls to look for include the machine running out of disk space; errors created by the program writing the files; restarting the script which monitors the directory after an OS reboot; a DNS server failure; or the script being unable to resolve IP issues with the cloud-based database.
Every time a leak occurs, whether due to schema changes or input/output failures, sealing it is the first step. The second, and often forgotten step is recovering the data.
No. 3. Ensuring Data Integrity
The more leaks that occur, the more difficult it is to ensure data integrity. Leaks create the equivalent of scars in the data, sometimes leaving data in an unrecognizable state – or worse – rendering unrecoverable. Depending on the duration and severity of the leak, data integrity can be compromised for hours, weeks, or even months.
Poor data integration not only frustrates data analysts, but it also hurts the business’s ability to make reliable data-driven decisions.
As a business grows, so do the number of users and that means more data. If an original battle-scarred script for data migration is struggling to cope with 100,000 users and data leaks, think how much worse things will be with 1,000,000 users?
The organization will need more input sources and a larger cloud database. Consequently, the amount of potential schema changes and failures will increase exponentially, putting data engineering resources under constant stress.
To cope with the increased data complexity, most organizations will develop more software, seek help from a third-party company or embrace new technology such as distributed stream processing.
Best Practices for a Synchronizing On-Premises and Cloud Data
There are several options available for migrating a data pipeline architecture to the cloud.
The first, doing it with in-house resources, requires highly skilled programmers who know how to connect data center technologies to cloud ones. Going this route is generally very time-intensive, error-prone and expensive.
To complicate matters, most programmers specialize in either the legacy world or the cloud, not both. Few of them know or care about the nuts and bolts of connecting the two worlds.
There are two alternative options: outsource the project to a third-party service provider or buy technology. Clearly, the service option is superior since it allows the organization to select a proven solution already implemented by others.
Buying technology is appealing because the organization will own the software and hardware that can be customized to its specific needs and environment. The downside; people will have to spend hours learning the new bits and pieces and figure out how they connect to legacy and cloud environments.
We have unpacked the three most pressing challenges of moving data to the cloud, as well as several approaches for doing so. Building a data pipeline in-house or selecting a third party to do the work are certainly options. Using a proven data pipeline service, meanwhile, can reduce the risks, time and cost.
About the Author: Yair Weinberger is Co-founder and CTO of Alooma, a company that is transforming data integration. He is an expert in data integration, real-time data platforms, big data and data warehousing. Previously, he led development for ConvertMedia (later acquired by Taboola). Yair began his career with the Israel Defense Forces (IDF) where he managed cyber security and real time support systems for military operations.