Data Science Platforms As a New Force Multiplier
Models matter. Companies that are able to build their businesses around meaningful models generate competitive advantage through better understanding the needs of their customers, their business model, and their ability to influence the market.
With artificial intelligence on the verge of a breakthrough, companies are heavily investing in people and technology, yet the majority of companies struggle to generate value from their data science practices.
The creation of these models creates new challenges to the modern enterprise. Model management is different from seemingly similar practices like software development. Models are developed through a research process and behave probabilistically whereas software development is deterministic. These differences mandate the use of different materials, processes and behaviors.
The need for a platform that consolidates these materials and processes, and helps to define and govern the behaviors into a single location is critical for enabling data science teams. The platform should enable teams to collaborate and fulfill their purpose in building models that change their business.
These platforms should also enable IT leaders to safely deliver the freedom required for data science teams to experiment and explore with data; while also maintaining regulation and compliance standards of IT policy.
In this article, I will break down the barriers that businesses often face when attempting to deploy models into their organizations, why data science teams need a data science platform to help improve their productivity and generate outcomes; and lastly introduce the concept of a data science system of record to help address these challenges.
Three Ways Models Are Different
Models are different. The biggest barrier to becoming a model-driven business – the cause of the friction and challenge that companies encounter – is what I like to call the Model Myth.
It’s the misconception that because models involve code and data, an organization can manage them like existing assets; be it software, databases or BI dashboards; using the same people, processes and platforms from the past.
Models are a new, fundamentally different type of digital life. They’re different in three ways:
Models use computationally intensive algorithms that evolve on a near daily basis. These new technologies and techniques come from a vibrant open source ecosystem which makes it hard for traditional IT processes to adequately manage them and ensure they’re available to data science teams to exploit.
The infrastructure demands of models require scalable compute and increasingly, the need for specialized hardware such as GPU processing has become critical to performing these functions in time critical environments.
The process to build models is different. Data Science fundamentally is research – it’s experimental, iterative and exploratory. You might try hundreds of ideas before getting something that works and often you’re picking up from the work of another data scientist.
Data Scientists therefore require the freedom to experiment inside the organization; to explore data and to be able to work with other data scientists to test their assumptions and hypotheses. They need to be able to publish results in a meaningful way that can be used to generate a change step within the business, be it new pricing, or a new product, or a new way to engage customers.
Models are probabilistic; they do not deal in “correct” answers – they just determine what is most likely to happen based on their input parameters. These parameters are constantly changing, so unlike software, models change as the world around them continues to change.
Think about business leaders and IT professionals who have spent the last two decades immersed in “big data” or enabling traditional software development processes. These differences represent something completely different to the world they’ve previously lived in.
Also, think of Data Scientists. Tasked to build models that generate meaningful impact to the way a business operates; quite often they’re governed by processes that erode the ability they have to do what they do best: research, experiment and explore.
Barriers to Success in Data Science
The potential that algorithms have to change the way businesses operate is well understood; as such many organizations rushed to invest in building data science teams and technologies in order to capitalize.
Because models are different, organizations often struggle to set up the right environment to succeed.
Some of the key challenges faced by organizations are;
- Being able to manage the infrastructure challenges that come with Data Science; deploying and scaling in an agile manner;
- Providing clear knowledge management and collaboration solutions that enable cross functional teams to work together seamlessly;
- Producing clear change management principles, versioning control, and an ability to conduct code reviews in a timely manner;
- Ability to deploy the results of a model into a production environment that can be accessed by the stakeholders and/or applications that require their predictions.
- Providing clear visibility into the status of a project.
Orchestrating the workflow of a data science project can therefore seem like managing five-year-olds playing a game of soccer. Everyone is keen, everyone is chasing the ball but there is little structure and no strategy for how the game is played. Crucially, there is limited understanding that other players are required to assist in scoring or defending goals.
If we correlate the challenges faced by data science teams into other departments within the business, we see a clear gap in the foundational technology required to manage these outputs.
Whether it’s Sales with Salesforce, HR with Workday, or Engineering with Github; mature disciplines require systems of record for managing the information, processes, workflows, and outputs of these disciplines. Without these, it’s hard to organize around shared principles or scale with best practices in a large enterprise.
No wonder data science teams can’t scale. Simply put, Data Science needs its own system of record.
A Data Science System of Record
To help businesses unlock the intrinsic value of models in their organization; data science platforms should provide a set of tools that creates a framework in which IT Leaders, Data Science Leaders and Data Scientists can work together to create the environment required to be successful.
Data Scientists need flexibility, the ability to self serve their own infrastructure and scale it up and down as required given the ephemeral nature of many data science projects. They want freedom to use the tools that they are comfortable with, and the ability to evolve new tools and techniques into their kitbag as they become popular or help to solve a business problem. The transient and dynamic nature of many data science infrastructure needs can stress traditional IT practices.
Data Science leaders want visibility into how data science projects are performing, and also need management metrics on models that have been deployed into production. They want to be able to identify blocked projects early in the piece and to assist their data scientists in delivering reproducible results at speed by finding and using existing assets and work. These leaders care about enabling their teams to collaborate on complex projects spanning disparate geographies and tools. Data Science leaders also care about reproducibility and auditability; the ability to precisely reproduce findings years later in the case of regulated industries.
IT Leaders want to provide the flexibility and freedom that data science teams require, but need to do so in a safe way that doesn’t compromise enterprise security or governance standards. They want to reduce the stress that constantly adding and updating new packages into environments brings to most IT architectures. They also need cost controls and administrative tools similar to those they’ve used when deploying cloud based infrastructure in the past.
These systems of record should support compounding knowledge growth; as new projects are worked on and resolved, the institutional knowledge and assets that are generated should be easily searchable for subsequent project teams.
They should centralize access to different tools – not just model development tools but end to end across the entire life cycle of a data science project; i.e a central hub and repository for all code, documentation and workflow for each data science workload in the organization. Much like other systems of record they should also automatically document logs, metrics and results of projects to be reviewed.
Models have become foundational for successful, data-driven enterprises. In order to fully realize the value of their data scientists, mature organizations are realizing they need to advance from having teams using disparate tools and practices to having a centralized data science system of record.
By bringing together these disparate tools and processes under the umbrella of an orchestration platform, data science leaders can encourage collaboration, knowledge sharing and agility among their teams, as well as give their teams the flexibility to try new tools and standards in an ever-evolving landscape.
A data science platform can also allow an organization to future proof their data science investment and avoid the pitfalls of vendor lock in if they choose a platform that isn’t wed to any particular infrastructure substrate or compute framework.
As Naveen Singla, data science center of excellence lead for crop science at Bayer, said, “We needed a platform that could abstract away complexities and allow all users to do analysis at scale, utilizing the modern tech stack, and getting better insights from data… This ultimately results in more models being delivered and deployed in a shorter window of time, which is empowering Bayer to be a model-driven company that’s at the forefront of farming.”
About the author: David Bloch is a data science professional with 20 years experience working in data and analytics roles. He recently joined Domino Data Lab in the position of data science evangelist where he is tasked with boosting awareness of the platform and product and assisting customers drive adoption of models and machine learning. David has a particular focus in assisting businesses to build out their community of expertise in data science and coaching data science leaders on how to build high performing teams. David previously held executive leadership roles in companies such as Fonterra, Vodafone and Unleashed Software.