Why Big Data and Data Scientists Are Overrated
What does it take to get value out of data? Many organizations assume that you need a big collection of data and a highly skilled data scientist to spin all those 1s and 0s into dollar signs. In reality, companies need neither of those things to be successful with data.
One of the biggest mistakes that organizations can make with their data analytics projects is to assume they need a data scientist at the very beginning. According to Daniel Mintz, chief data evangelist with Looker, organizations are much better off starting lower on the data analytics food chain and working their way up as they gain proficiency.
“I’ve seen cases where people hired a data scientist way before they’re ready,” Mintz tells Datanami. “They don’t actually have any data, and even if they do, it’s dirty and dispersed across a whole bunch of places. The data scientist who doesn’t necessarily understand their business arrives and says ‘Where’s the nicely curated data set that you want me to use to solve problems?’ And they say, ‘Oh we didn’t’ know that was a prerequisite.'”
The fact is, data scientists spend about three-quarters of their time doing data janitorial work – collecting, transforming, and cleaning data – rather than building the complex predictive models that they were actually hired for. That equals frustration for data scientists who had high hopes of making an impact, and sour grapes for the people who hired them.
Organizations should start with the basics, and work up from there. Instead of being lured by the “shiny object” syndrome and thinking you need a big Hadoop data lake or neural networks to solve a problem, seek the simplest answer.
“People make a mistake if they jump right to the most sophisticated tool, because they’re wasting a lot of time,” Mintz says. “The reality is a lot of problems are quite tractable with a simple regression. And some problems don’t even need that. You can just look at the data and see what’s happening.”
Mintz’s personnel advice? Hire data generalists who can do the time-consuming data legwork that’s needs to be done before more highly skilled (and highly paid) data scientists come in to do their highly specialized thing.
“The really key skill is having somebody who can take what is fundamentally a business question and translate that into a data question,” says Mintz, who previously at MoveOn.org and other data-intensive operations. “That’s the key skill. When you’re not big enough to have specialists, the business people, who aren’t data people, will know what the right business questions are.”
Mintz recommends pairing a SQL-loving analyst with an ETL-loving engineer to start helping the business prepare themselves to answer questions with data. As they document their data stores, define organization-specific metrics, and create workflows that transform and combine data in reliable and useful ways, they will start to see how the super powers of the real data scientists could best be used.
“As they scale up,” Mintz says, “they realize ‘Now we have three or five analysts, now we’re ready to add a data scientist, because now we’ve got a handle on what our data means, we know where things live in our schema, we know what problems might be tractable using a more sophisticated algorithms.'”
There are real benefits to be had by analyzing data, but like everything else in life, you must walk before you run. Hiring the right people to make up your data analytics team, and hiring them in the right order, is important.
“Folks are looking for a magical unicorn who can do it all, when in reality it’s a team sport,” Mintz says. “You really need to be thinking about how does the team work together, and how as a team do you cover all the bases. That starts with somebody who’s a utility player who can play all the positions, then you start to specialize.”
Follow the Data Crumb
Just as you don’t start with a data scientist, you shouldn’t start with big data, either. In fact, it’s much better to start with the right piece of data, however small that is.
For Wolf Ruzicka, the chairman of Washington D.C.-based analysis firm EastBanc Technologies, it starts with a single crumb of data.
“Just the other day I ran into a company that has accumulated 50PB of data,” Ruzicka tells Datanami. “That’s great. But when you compare them against competitors…all the metrics — profit margin, revenue, growth, size — really are the same. They were very proud of that 50PB of data lake. But really when I look at it, it must have turned into a data swamp.”
When EastBanc engages a new client, there is a flurry of activity and brain storming meetings as EastBanc analysts do their best to understand the business problem at issue, and the potential data available to solve it.
The company starts small and works quickly. The customer may have more pressing questions they want answered, but starting with the low-hanging fruit on easily explored data is a good way to get started, and provide validation that the analytics are worthy. Setting a hard initial deadline of two to four weeks helps encourage fast iteration.
That first piece of useful data becomes a “data crumb” that typically leads to further success, Ruzicka says. “That’s what we call it,” he says. “One data crumb of relevant data, and we iterate from there.”
When you draw it out on whiteboard, it looks very different than a typical big data architectural drawing. “It’s more of a data tree that you’re starting to groom,” he says. “You may end up with big data. But you don’t start with big data. You essentially turn it upside down.”
This approach is anathema to the current wave of big data thinking, which says one should throw all of one’s data into Hadoop, and hope that magical algorithms can make sense of it down the line. This approach may work, but most likely through sheer luck, Ruzicka says.
Ruzicka’s advice: It’s better to start with a smaller data set that’s more reliable and useful, than to start with a bigger data of unknown value.
“Instead of being this pathological data hoarder, rather be someone who assembles the data and continuously goes through data spring cleaning at very regular intervals,” he says. “Just as bad as it may be not to have any data, it’s just as bad, confusing and expensive to have lots and lots of data and not make any use of it.
“So why not find that middle ground, where you iterate around data breadcrumbs that have correlations with each other, that bring value to each other, and then your purposefully build up that big data base that you may ultimately end up with,” he continues. “Just find something of value and iterate from there, and over time you will answer the unknown unknowns that you were not even aware of in the beginning.”
June 18, 2021
- Alva Named Winner in AI and Machine Learning Awards 2021
- Collibra Announces 24 Gold and Silver Partners for 2021
June 17, 2021
- Esri’s ArcGIS Platform Chosen for Red Bull X-Alps Competition Live Tracking App
- Collibra Announces 2021 Excellence Awards
- Latest Release of InterSystems IRIS Data Platform Provides Next Step in Data Fabric Adoption
- Zaloni Automates Data Governance, Fast Tracks Data Access with 6.4 Platform Release
- Qumulo, HPE GreenLake Cloud Services to Provide Pay-As-You-Go File Platform for Unstructured Data
- Lucidworks Joins Google Cloud Partner Advantage Program, Launches AI-Powered Search Platform
- TigerGraph Announces Center of Innovation in San Diego, R&D and Recruitment Efforts
- Monte Carlo, PagerDuty Integration Bring DevOps to Data Pipelines with End-to-End Observability
- HPE Passes Rigorous Splunk Engineering Tests for Kubernetes Operator with HPE Ezmeral
- Partners Together Now: Snowflake Announces FY21 Partner of the Year Award Winners
June 16, 2021
- Vertica Announces Early Access of Vertica Eon Accelerator
- Alation Named Top Vendor in End-User Study of Data Catalog Market for Fifth Consecutive Year
- Fetch.ai, Poznan Supercomputing and Networking Center to Develop AI Tools For Cancer Cell Detection
- MLCommons Releases MLPerf Tiny Inference Benchmark
- LexisNexis Risk Solutions Celebrates 10-Year Open Source Anniversary of HPCC Systems Platform
- GRAX Announces History Stream, Unleashing SaaS App Data for Easy Downstream Consumption
- Infinidat Expands InfiniBox Line with New Solid-State Array for Demanding Enterprise Applications
- Imply Closes $70 Million Series C at $700M Valuation
Most Read Features
- Newly ‘Headquarterless’ Snowflake Makes a Flurry of Announcements
- Big Data File Formats Demystified
- Do Customers Want Open Data Platforms?
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- Understanding Your Options for Stream Processing Frameworks
- Why Data Science Is Still a Top Job
- Databricks Unveils Data Sharing, ETL, and Governance Solutions
- Three Reasons Python Is The AI Lingua Franca
- Cloudera To Go Private in $5.3 Billion Buyout by Wall Street Firms
- What’s Driving Data Science Hiring in 2019
- More Features…
Most Read News In Brief
- Confluent S-1 Reveals ‘Reimagining of Business’ Theme
- Confluent Files to Go Public. Who Could Be Next?
- Lakehouses Prevent Data Swamps, Bill Inmon Says
- Google Cloud Tackles Data Unification with New Offerings
- Google’s ‘Breakthrough’ LaMDA Promises to Elevate the Common Chatbot
- Qualcomm Unveils 5G Modem for IoT
- Databricks Unveil New Machine Learning Solution
- Dremio Charts Open Course with Dart
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- MIT Researchers Leverage Machine Learning for Better Lidar
- More News In Brief…
Most Read This Just In
- SAS Named a Leader in Streaming Analytics Per Independent Research Firm
- Relativity Acquires Text IQ to Drive Leadership in AI for e-Discovery, Compliance and Privacy
- Sumo Logic Signs Definitive Agreement to Acquire Sensu to Extend Open Source Strategy
- University of Texas at San Antonio Researchers Collaborate to Improve Computer Vision for AI
- US Air Force RSO Expands Engagement with C3 AI as Strategic AI Platform
- Latest Release of SnapLogic Fast Data Loader Provides Fast, Free Cloud Data Warehouse Loading
- Esri’s ArcGIS Platform Chosen for Red Bull X-Alps Competition Live Tracking App
- Dgraph Rises to the Top Graph Database on GitHub with 11 G2 Badges, 11M Downloads
- Incorta Announces Tableau Connector to Extend Faster Data Analytics to All Customers
- NVIDIA to Acquire DeepMap, Enhancing Mapping Solutions for the AV Industry
- More This Just In…