Exposing the Data Scientist Myth: Using Big Data Without Them
Many organizations assume that big data initiatives require mythical data scientists. This notion is partly propagated by the media scrutiny surrounding the data scientist shortage and by the numerous responsibilities of this profession, which include data preparation, cognizance of business problems and analytics, among others. Harvard Business Review seemingly solidified this perception with its announcement that data scientists had the ‘sexiest’ job of the 21st century.
However, a less acknowledged (yet perhaps more pervasive) reality has quietly emerged within the wake of the aforementioned hype. The self-service movement has flourished throughout some of the most vital aspects of the data landscape and concentrated on big data. There are numerous situations and examples in which data scientists are not required to implement big data staples of data preparation, analytics and data governance.
The self-service movement has not replaced the need for data scientists, but instead created a new reality in which the business and end users are able to seize control of big data and begin leveraging it without them.
The self-service era has perhaps gained the most traction—and success in automating components of data science—in the realm of analytics. There are multiple self-service options organizations can utilize to gain analytic insight on big data sets via the cloud, which offers the additional cost advantages of decreased physical infrastructure and price plans based on actual usage. In most instances, organizations must simply migrate their data to the cloud to perform any host of analytic options (descriptive, prescriptive, predictive and diagnostic), which makes this method viable for even small and mid-sized businesses.
Competitive vendors offer assortments of business intelligence tools, plug-ins for popular applications such as CRM, and a variety of visualizations and dashboards for publishing. Some cloud-based analytics providers even perform analysis for their customers, who merely identify business objectives or specific queries and simply wait for the results courtesy of sophisticated graph analysis.
Certain graph technologies can also help end users issue sophisticated queries without writing code due to their visual representation of data objects. “Without writing any SQL you can just click on the screen, specify a query as a sub-graph, and then the system will write the query and figure it out,” Franz CEO Jans Aasman noted about graph-based queries. Significantly, many service providers in the analytics space either employ or are operated by data scientists, which implies that one of the ways the self-service movement is impacting these professionals is by shifting their employment from the enterprise to cloud vendors.
One of the critical aspects of big data analysis that has traditionally been associated with data scientists is the modeling process, which exists at the nexus between analytics and data preparation.
Typically, big data modeling is extremely time consuming and can delay time to insight. Nonetheless, machine learning algorithms can substantially hasten this process by providing future models based on current data and their uses.
Organizations can leverage machine learning technologies from service providers specializing in Machine Learning-as-a-Service, or from cloud vendors who focus on an assortment of data discovery and predictive modeling capabilities. Competitive vendors provide models and analytics results with Natural Language Processing-based explanations, as well as with suggestions for action.
Self-service data modeling for big data sets enables end users to incorporate a variety of sources into their analytics options. Cognitive computing solutions specialize in this aspect of analytics and enable the incorporation of on-the-fly, time-sensitive data (weather, news, etc.) with conventional enterprise sources for expedient analytic insight—without end users hiring data scientists.
The self-service movement encompasses data preparation in two fundamental ways, both of which are based on semantics. The first of these involves data preparation tools and platforms that are designed to handle the wrangling process which often includes cleansing, integration, and transformation for analytics or application use.
These solutions provide overviews of enterprise data and identify the relevant attributes that make integration between sources advisable for specific use cases; some catalogue metadata for such purposes and combine with intuitive visualization capabilities to provide such information at a glance. According to Tamr Global Head of Strategy, Operations and Marketing Nidhi Aggarwal, the effect is that “You can actually switch from having the IT and the coding people be the only people that can interact with data, to the business people.” Machine learning algorithms can expedite integration considerations and action (based on sources or data types) to suit use cases. According to Forrester, vendors are equipping more BI and analytics tools with self-service data integration capabilities for ETL.
The self-service movement is also enabling end users to bypass conventional preparation and integration concerns with semantically enhanced data lakes. The incorporation of graph-based models and ontologies provides visualizations of data and descriptions of their properties, respectively, enabling users to select which data to integrate for specific purposes. Both of these methods gives end users more control of their data and the preparation process without waiting for data scientists. According to Cambridge Semantics president Alok Prasad:
“If you have to get in line to get your data prepared and ready before you can use it, you’ll only go to the data scientist for bigger problems and not the smaller ones. What you need is the ability to self-serve where users themselves can understand the data, discover the data, and use different elements of data to analyze, filter, and push it to other systems.”
The semantic approach of preparation tools and data lakes is also critical for reinforcing data governance protocols, without which the self-service movement would produce more harm than good.
Smart data technologies can help to facilitate role-based access to data, regardless of where they’re stored, and provide crucial information about data lineage and traceability. Organizations can catalog metadata with certain preparation solutions, enabling them to tag data according to use or user as mandated by governance policies.
The confluence of semantic and metadata consistency can create the foundation for technological conformity to governance principles in a self-service environment.
Data Science Automation
Subsequently, it is not only possible to utilize big data without these professionals, but to do so inexpensively and in accordance to established procedures for data governance.
However, the newfound control over data that this movement gives end users does not marginalize data scientists. They still add value by discerning solutions and tailoring applications to address business problems with their unique combination of skills. Nonetheless, organizations can—and are—deploying big data in the midst of the data scientist shortage.
About the author: Jelani Harper has written extensively about data management for the past several years. He specializes in semantic technology, big data, and their many different applications.
(feature art courtesy Mopic/Shutterstock.com)