Octopai Brings ETL Reverse-Engineering Tool to Azure Data Factory
If you want to make good data-driven decisions, it’s critical for your data to be as accurate as possible. But ensuring the accuracy of data can be a difficult task in large enterprise environments with thousands of combinations of data sources, transformations, tables, and reports. Now Octopai is bringing its unique approach to deconstructing ETL flows and finding the source of data discrepancies to Azure Data Factory, the popular ETL service in Microsoft’s cloud.
Amnon Drori, the founder and CEO Octopai, knows how troublesome enterprise BI environments can be. Before he founded Octopai in 2015, he experienced firsthand the problems that can arise from not having a common set of numbers.
During a quarterly business review, Drori’s report showed that he had acquired 35 new clients. However, the CFO’s report only showed 22 new clients. The CEO was not pleased about the discrepancy of 13 clients, which, at an average deal size of $300,000, amounted to a lot of money.
“The CEO looked at us both and said, ‘Hey guys, somebody in this room owes me $3.5 million,’” Drori recalls. “’What the heck is going on?’”
It took three weeks to figure out what the problem was. It turns out an ETL process had been updated to include a new data source from the CRM system. Drori’s report reflected the ETL update, but the reporting his colleague Rafi was using did not, thus leading to the inaccurate number.
Data analysts and their line-of-business business colleagues face this exact problem every day. Despite the efforts put into ensuring data quality and data consistency, data just seems to find a way of getting out of whack. The bigger the organization, the more opportunity there is for these data gremlins to show up and make peoples’ lives miserable.
Drori shared some figures to demonstrate the vast potential for data-driven pain: “If you think about an organization, if they have 1,000 ETL processes that are shipping data from data sources, and storing 10,000 tables in the data warehouse, and there’s a reporting tool that’s generating 1,000 reports, we’re talking about 10 billion data pipes between all of what I just described,” Drori says. Finding exactly which pipe is responsible for the problem is not easy, and is what Octopai specializes in.
This problem existed 15 years ago, when the volume and variety of data flowing through ETL tools into on-prem data warehouses was smaller. The scope of the problem today is expanded, as enterprises seek to manage much bigger flows of data in cloud and hybrid BI environments, in the hopes of becoming even more of a data-driven business.
“I think that the complexity to a certain extent is growing faster than organizations are capable of dealing with,” Drori says. “I would say that moving to the cloud and adopting a new set of BI systems may even create a bigger problem.”
When Octopai emerged from stealth about two years ago, it’s main offering was the data lineage software that effectively reverse engineers the ETL jobs that are so critical for preparing raw data for analysis in a data warehouse.
During that time, the company has built additional tools around that original data lineage offering to help its clients better understand their data. The first new offering was a data discovery tool that helps clients understand where all their data is and how it changes. Then the company developed a data catalog offering to help them standardize how clients talk about their data. These products now compose a suite of software that Ocotopai offers as a service from the AWS and Azure clouds.
The company’s offering supports cloud-based and on-prem BI environments, including ETL processes, data warehouses, and BI reporting environments. Over the past 18 months, the market has undergone a significant shift to the cloud, and Drori sees much of his growth coming from the cloud, even as companies maintain on-prem systems.
“In some cases, you might adopt Snowflake while not getting rid of your Oracle. You might start using Talend or ADF as new data generation of ETL or business process development in a cloud, but you don’t get rid of DataStage or Informaticas that you have. So now it create a bigger problem because migrating from Informatica to ADF may take three years. Moving from OBIE, a 15 year old reporting tool, to PowerBI is about a two-to-three year project.”
The company has a bigger presence on the Azure cloud than any other, so Azure Data Factory, or ADF, is the first cloud-native ETL tool that the company is supporting, with others planned for the future.
ADF is Microsoft’s cloud-based ETL or ELT service for creating and orchestrating data workflows in the Azure cloud. It features a visual tool for defining data transformations, and leverages an Apache Spark engine running in Azure HDInsight to execute the transformations.
ADF supports dozens of data sources and data sinks, including Azure data stores like ADLS Gen 1/Gen 2 and Cosmos DB, as well as many third-party databases, file systems, object storage systems, SaaS applications, APIs, and protocols. It integrates with Azure DataOps, and supports the use of parameters, triggers, and complex orchestrations of data pipelines involving custom-state passing and looping containers.