WANdisco Drops In On the Hadoop Cloud Migration Party
Organizations that want to move their on-prem Hadoop cluster to the cloud may be interested in a new solution unveiled yesterday by WANdisco. Called LiveData Migrator, the software allows customers to move their Hadoop data to any public cloud without taking the cluster offline, and guaranteeing the data is up-to-date and accurate in both locations until the migration is complete.
Moving small amounts of data, such as a few terabytes, to the cloud is not that difficult. You can pipe the data over the Internet or ship it on a disk drive. But when you have large amounts of transactional data – that’s where it gets interesting, says David Richards, the CEO of WANdisco.
“When you have millions of transactions per second, the data is changing all the time, and petabyte scale data–how do I move that data from on-premises to the cloud? Richards says. “The answer is, quite frankly, you don’t, because it’s a massive service gig, or you have to undergo an elongated outage that’s just not realistic. So companies just don’t do it.”
That’s the situation that confronted GoDaddy, the website domain registration company, which operated an extremely active Apache Hadoop cluster that processed millions of transactions per day. Turning off its 800-node Hadoop cluster while it migrated data to AWS would cause a prolonged outage for the company.
But with LiveData Migrator, GoDaddy was able to migrate 70 TB of HDFS data from its on-prem cluster to AWS S3 in just five days, according to Wayne Peacock Chief Data and Analytics Officer, GoDaddy.
“We found WANdisco’s LiveData Migrator to be the optimal approach to deliver the best time to value, rather than running a more time-consuming and costly manual migration project internally,” Peacock states in a press release.
WANdisco has been an active player in the Hadoop scene for years, and developed some of the most sophisticated data replication technology to provide high availability for Hadoop clusters. That technology formed the basis for LiveData Migrator, says Richards.
“We’ve got patented technology that surrounds our distributed coordination engine that understands the sequence of transactions at massive scale and is able to maintain that order of transactions in light of all sorts of [disruptions] to the wide area network,” Richards said. “So in essence, what we’ve done is leveraged that technology to the specific use case. It’s taken us two years to build it from a mathematical design. But we’ve done it.”
Richards says that math allows LiveData Migrator to do something unique in the industry: To scan HDFS only once, and then maintain an up-to-date copy of the files, no matter the volume nor the velocity of those changes. Depending on transactional volume, the software can move 1PB of data to the cloud in 30 to 60 days, Richards says.
“We realized we needed to come up with a screaming fast Formula 1 engine to move data on premise to cloud,” he says. “We guarantee one scan of the data. There’s not multiple scans of the data. Everybody else is going to have to recursively scan data until they get it down to an invisible point where they say, OK unplug it. We just do one scan. So we’re exponentially faster than anybody else just for that one reason. That requires some pretty complex math.”
While the math behind LiveData Migrator is complex, using the product is relatively simple. According to Richards, customers simply install the software as a client application on the Hadoop cluster. The software, which most customers install on a repurposed data node, monitors the cluster’s name node.
The software does not require a beefy machine. Typically, a machine with 50% to 70% of the RAM of the name node will do the trick. We just key off of the name node,” Richards says. “The number of transactions we’re going to hold is bound by the size of the name node anyway….You can overspec the server if you want. It won’t make much difference. The cleverness is in the technology, how do a single scan very fast. We can multiplex the number of connections so I can saturate the bandwidth if I need to. But the scan is not CPU intensive.”
Customer can migrate the data using the network or they can write the HDFS data to a dedicated storage device that will then be mailed or shipped to the cloud. Some customers do not want to use the network for data migration, Richards says, and LiveData Migrator gives them that option.
Clients don’t need any special authorities to run this, either with the on-prem Hadoop cluster or the big Hadoop cluster in the sky that they’re moving to (all three cloud vendors offer their own versions of Hadoop). The software supports AWS, Microsoft Azure, Google Cloud, IBM Cloud, and Alibaba Cloud, according to WANDisco’s website. Richards says the company will support destinations like Databricks, Snowflake, and “a whole raft of them in the future.”
“In the case of AWS, this is pretty much a turnkey solution,” Richards says. “You can be up and running and doing a migration now in about 15 seconds. Our old product required a deep understanding of Hadoop. We were in the write path and so on. You don’t even have to be in the write path of Hadoop anymore. We run like a client application on Hadoop. I don’t need any administration access. I don’t need any special skills in Hadoop. I just plug it in, turn it on, connect it to the cloud, put my cloud credential in, and away you go.”
WANdisco charges for the product based on the transactional volume of the host cluster. It provides the first 5TB free. Moving 1PB would cost around $150,000, he says.
Once customers have successfully migrated their on-prem Hadoop cluster to the cloud, Richards hopes they maintain an active license to LiveData Migrator.
“Nobody is going to choose a single cloud vendor,” he says. “As much as every single cloud vendor thinks that’s what’s going to happen. That’s just not going to happen. We’re seeing increasing demand for active-active, multi-cloud. In other words, I need to arbitrary run applications either in a Azure or AWS or Google Cloud. I’m going to choose on any given day, minute, second, where I’m going to run my applications. That’s also something we’re providing.”