Follow Datanami:
September 16, 2022

Security Concerns Causing Pullback in Open Source Data Science, Anaconda Warns


More than 40% of organizations surveyed by Anaconda say they’re pulling back on their use of open source data science tools due to security concerns, with potential vulnerabilities such as Log4j the number one driver, the data science tool maker said in its latest State of Data Science report.

Nearly 90% of the 3,493 respondents to Anaconda’s survey indicate they use open source software in their organizations. Anaconda’s distribution of Python and R tools is one heavily used open source project for data science (used by 47% of respondent), as are GitHub (45%), RStudio (33%), Databricks (16%), and H2O (10%).

Only 8% of the survey respondents said they’re not allowed to use open source at their organization. The number one reason this cohort has not adopted open source is concerns about vulnerability, potential exposures, and risks, with 54% expressing these fears, Anaconda’s report says. That is a 13% increase from the 2021 report, the compny says.

The vulnerability in Log4j discovered about 10 months ago is casting a long shadow on the entire open source software community, as concerns about the so-called “software supply chain” ricochet among open source users.

(Source: Anaconda’s 2022 State of Data Science report)

About 25% of the survey respondents said they scaled back their use of open source following the Log4j vulnerability was disclosed, with another 15% saying they scaled back before then. One third of respondents said they have not scaled back open source software usage, while only 7% say they have increased it.

Anaconda also looked at how organizations are securing their open source data science and machine learning software. The company found that 43% of survey respondents reported using a managed repository, while 36% say they use a vulnerability scanner (a figure that was up about 6% year over year). Another 34% reported that they do manual checks against a vulnerability database, the report says, while 19% are not securing their open source pipelines (luckily, that figure was down almost 6% year over year). Nearly a quarter (23%) say they’re not sure.

But it wasn’t all doom and gloom in the field of data science. In particular, Anaconda found some progress being made in another particular subfield of data science: explainability and bias mitigation.

On the model explainability and interpretability front, Anaconda found 36% of survey respondents indicate they are using tests to assesses interpretability, while another 30% have implemented ways to prevent the cherry-picking of data. A bit more than one quarter (28%) say they only use low-interpretability models in low-risk scenarios, while another 28% say they use statistical tests to assess variable infidelity. Only 24% said they’re not using any measures or tools to ensure model explainability and interpretability.

Progress was also spotted in terms of model fairness and bias mitigation. Anaconda found that nearly one-third (31%) of survey respondent say they evaluate data collection methods according to internally set standards, while 25% say they manually test data sets for fairness and bias. Nearly one in five (19%) say they perform a suite of statistical fairness tests, while 15% have a center of excellence. About one quarter (24%) say they have no standards for fairness and bias mitigation.

Anaconda also looked at what data science skills respondent companies are looking for, and inquired about a potential talent shortage looming on the horizon for data science organizations.

(Source: Anaconda’s 2022 State of Data Science report)

Engineering skills stood out as the most in-need skill in the data science organization, with 38% of survey respondents choosing this cateogr as the number one concern. That was followed by probability and statistics (33%), business knowledge (32%), and big data management (31%), the survey says.

Overall, about 90% of professional respondents say their organizations “are concerned about the potential impact of a talent shortage,” Anaconda says, with nearly two-thirds (64%) saying they were most concerned about their organization’s ability to recruit and retain technical talent. More than half said insufficient headcount could hurt the organizations’ adoption of data science.

Despite the negative outlook on the skills front, Jessica Reeves, senior vice president of operations at Anaconda, isn’t too concerned.

“With data scientists continually cited as one of the best careers in the U.S., the pool of talent is sure to catch up to the demand,” Reeves said in a press release. “Solutions proving successful to help close this gap include upskilling existing workforces and permitting stronger remote work options. Organizations should bolster the tools and resources available for continued learning, and academic institutions should fill in the skills gaps for students and turn them into strengths as they prepare to enter the workforce.”

You can access a copy of the report here.