An Enterprise Guide to a Secure Data Science Pipeline
Open source is the backbone driving digital innovation (Gartner, 2019). It’s crucial to many of today’s leading-edge digital fields, including data science and machine learning. No single technology vendor can outmatch the pace of innovation the open-source data science community maintains. Thousands of open-source Python, R, and Conda packages provide data science practitioners with the building blocks they need to create models and applications using predictive analytics, natural language processing, robotics, and other cutting-edge tools.
These open-source tools are powerful, and they are essential for differentiation in a future where organizations must adopt AI to remain viable. But, there’s one thing many enterprise data science teams are missing: security protocols. In many organizations, there simply are no security protocols or governance tools for open-source software (OSS) use in data science. A lack of security protocols exposes the organization to overlooked defects and vulnerabilities, not to mention potential licensing and intellectual property issues.
In some organizations, DevOps teams have already adopted security protocols related to their use of OSS. DevOps uses open-source building blocks to accelerate their workflows and build applications, but generally they do so within a framework of security and governance to protect their work and enterprise infrastructure. Enterprise data scientists also use OSS tools and packages all the time. But, they use OSS without this safety net, putting the organization and customer data at risk. In some cases, DevOps teams may catch vulnerabilities in data science models when they attempt to put them in production. But, this means valuable data science team effort was wasted building a model that will never see the light of day.
When data scientists don’t monitor for potential threats, vulnerabilities inevitably creep into models over time. Data science leaders must step up and collaborate with IT and security leaders to take charge of their open-source data science and ML pipelines. Together, these leaders can increase the flow of innovative models to production while safeguarding against technical and legal risk.
Just Like all Software, Open Source Carries Risk
Companies tend to choose OSS over proprietary software because it offers more choice, support flexibility, transparency, and unmatched innovation.
The open-source community provides a veritable candy store of tools and libraries to work with — there’s no need to get tied down to any single vendor. Try new tools, choose only the best of the best (or the ones that fit your needs best), with minimal hoops to jump through.
With proprietary software, support is generally bundled in by the vendor and available either through the original license or for an extra fee. The software vendor offers what it offers, take it or leave it. With OSS, you have multiple options among support providers — including community support, third-party vendors, and hiring in-house staff to support your open-source components.
The source code of any OSS is viewable and fixable by anyone with the know-how to do so. Organizations using open-source software can verify its security themselves (or use an outside provider for verification). The source code in proprietary software, on the other hand, is usually only viewable and editable by a few internal people.
Data science and machine learning have a deep history with OSS, going back to the Apache Hadoop data-processing framework, which started a wave of open-source advances that’s still going strong. The top ML libraries, deep learning tools, and visual processing tools all came out of the open-source community. No single proprietary vendor can match its depth and breadth of Innovation.
Read more and get the complete guide here.