Data Lakes Crest In Drive to Boost Quality
As more data moves to the cloud, the composition of data lakes is shifting to new sources such as NoSQL databases while cloud data repositories emerge amid hybrid deployments, according to a big data survey.
The year-end survey released this week by “big iron” vendor Syncsort also found that Hadoop and especially Apache Spark continue to make inroads. Earlier enthusiasm for the Spark cluster- computing framework has translated into shift to production workloads. The survey found 70 percent of the organizations polled are either in test or production. Forty percent said they are in production with either Hadoop or Spark, which 30 percent are running proof of concept or pilot programs.
The survey released on Monday (Dec. 18) also underscores growing concerns about data quality and regulatory compliance as companies brace for new European Union data privacy rules to kick in next May. Syncsort reported that 40 percent of survey respondents—mostly in the financial and insurance sectors—said unreliable data is continuing problem, contributing to the steady shift to data lakes as a way to improve data quality.
Meanwhile, compliance with rules such as the EU General Data Protection Regulation is forcing companies to expand the scope of data governance as they place “a higher priority on putting processes in place that allow them to understand what their data is and where it has been,” the survey noted.
As ephemeral streams of data increasingly make their way into more permanent data lakes, 71 percent of those polled by Syncsort identified ETL as the most compelling use case. That result was well ahead of predictive, real-time and other analytics use cases, perhaps illustrating the pressing need for better data preparation tools as data lakes fill up faster with more unstructured sources.
“We are seeing increased adoption of data lake initiatives where organizations are very focused on governance of the data in the data lakes, increasing benefits through advanced analytics and machine learning and deployment of hybrid environments including cloud,” Tendü Yoğurtçu, Syncsort’s CTO, noted in a statement releasing the fourth annual survey findings.
“But those benefits can only be unlocked if organizations have access to enterprise data, can create trusted data sets and establish effective data governance practices,” Yoğurtçu continued. “This propels them to a place where they can not only adapt to digital disruption, but take advantage of it so their businesses thrive.”
As more companies embrace real-time capabilities such as Spark, the survey’s authors assert that customers will shift away from legacy platforms in hopes of harnessing data while reaping savings from investments in new data tools.
Syncsort said it polled nearly 200 respondents, including data architects, IT managers, developers, business intelligence and data analysts as well as data scientists at companies running either Hadoop or Spark. Among the industries represented are financial services and insurance, healthcare, government, telecommunications and retail.