Reverse Engineering ETL Jobs for Fun and Profit
As data-loving businesses amass more and more bits and bytes, the credibility of derived results — as showcased in BI dashboards, KPIs, reports, and sundry other deliverables flowing downstream of the ETL spigot — trends further out into dangerous territory.
Nothing reminds the data analyst of this tenuous balance like the CFO asking him where he got his numbers. “These numbers look off,” the CFO barks, glaring up from the preliminary 10-Q. “Where the hell did they come from?”
Because he knows “the data warehouse” is not an acceptable answer, the analyst understands that a painful reconstruction of all the relevant ETL jobs is in his future.
Amnon Drori has seen this situation play out time and time again. “A lot of companies want to be data driven,” the tech exec says. “Unfortunately, more and more systems are being involved in transforming or moving the data from applications to the business user, and it’s very difficult to track the data movement process around all of these systems.”
Even if one assumes that data is perfectly clean and accurate when it flowed from the CRM system, the accounting app, or the ERP system (which is not always a good assumption to make), there’s no guarantee that something didn’t muck it up somewhere along the line. ETL jobs, with their complex extractions, elaborate mappings, and numerous transformations, are often fingered as the culprit.
Drori decided the problem of traceability in multi-source ETL processes was great enough that he co-founded a company called Octopai to address it. The Israeli company’s eponymous product gathers all the metadata from different products involved in the data analytics chain, stores it in a cloud repository, and then uses algorithmic models to analyzes the metadata to determine where a given piece of data originated from and how it moved through the analytic process.
Octopai has sold its software to a range of big and small companies in a variety of industries, from financial services to manufacturing to retail. “They all suffer from one common denominator, which is where the hell is the data and how does it move between all the different systems we’re using,” Drori says.
The company’s cloud-based product gives users two key capabilities built on that metadata repository. The first is a tracing functionality that allows users to check the lineage of the data that goes into a report, dashboard, etc. The second is an impact analysis functionality that lets users see how changing one aspect of an ETL job changes things downstream.
The tracing function is valuable whenever questions arise as to the provenance of a derived result that requires data sourced from multiple systems, such as conversion rates. “The software automatically reverse engineers how the data landed in that report between system and within systems, what we call horizontal image and vertical images,” Drori says.
The impact analysis function, meanwhile, can help analysts comply with regulations like GDPR that require companies to know with great detail how their systems are processing data. “Within five seconds, I know that whatever I’m going to do to this ETL, either adding a table, adding a field, changing the map, joining a field, or masking a field…I know what would be the impact of that change,” he says.
Octopai goes beyond the data-tracing functionality that exists within individual ETL software products by providing a general-purpose solution that works across a range of data sources, warehouses, ETL tools, and BI products, Drori says.
“Our product can be used by anyone in the BI group,” Drori says, “not one or two people who have been certified after spending 5,200 hours using each one of these different vendors’ tools.”
Octopai doesn’t support every data source and sink. On the database front, it supports familiar faces like SQL Server, Oracle, Teradata, and Vertica. It supports ETL tools such as those from IBM, Informatica, or Talend. It supports PowerBI, Tableau, Qlik and SAP‘s BusinessObjects, among other BI tools. It doesn’t yet support Hadoop, but it’s on the roadmap to be supported later this year, Drori says. Octopai can, however, support IBM i and System z mainframe applications, which continue to run core business processes for tens of thousands of companies around the globe, according to Drori.
Subscriptions for Octoapai start at around $18,000 per year.