Getting Data Scientists and Data Engineers on the Same Page
Like cats and dogs, data engineers and data scientists often seem like two incompatible species. Scientists love probabilities and experimentation, while engineers live for repeatability and efficiency. They have different responsibilities and dissimilar mindsets, but getting these two personas to work together is a critical step for any organization that wants to succeed with data.
Data scientists emerged as the rock stars of the 2010s thanks to their ability to use machine learning algorithms to detect small differences in big data sets and exploit them for business advantage. As data continued to grow bigger and more complex, the work became more specialized and data engineers emerged as a critical cog in the big data machine.
In today’s larger organizations, you will often find a mix of data scientists and data engineers working with data (and perhaps other related positions, such as the machine learning engineer, which blends characteristics of both). While engineers and scientists, ostensibly, have the same end goal–their organization’s successful exploitation of data–their paths to achieve that goal could not be more different.
One data scientist who has given the problem some thought is Max Boyd. Before joining Kaskada as its data science lead, Boyd worked in numerous Seattle, Washington-area startups where he often worked closely with data engineers. Boyd was dependent upon the data engineers to get the data he needed to build and train his machine learning models, but the relationships were sometimes as dreary as the Northwest’s weather.
“One of the big things that come up and prevents us from working well together is…we think about problems in very different ways,” Boyd tells Datanami. “Our main metrics for success of a problem is done differently.”
As a data scientist, Boyd’s goal was to conduct as many experiments as possible to find the best machine learning model for a given problem. His data engineer colleagues, on the other hand, were focused on building data systems that were maintainable, reliable, and wouldn’t cause them to get a phone call at 2 in the morning because something broke down.
“That affect kind of bleeds through in terms of how we organize our work, how we think about our work, etc.” Boyd says. “Data scientists focus on the experimentation. Data engineers focus on data pipelines. Machine learning engineers focus on model orchestration and productionalization.”
Instead of relegating himself to his silo in the data organization, Boyd reached across the aisle and searched for ways that could give him what he needed as a data scientists, without forcing his data engineering colleagues to compromise on their approach.
Step one is just getting in the same room together. When data scientists and data engineers are forced to talk with one another, they will (hopefully) find some common ground. If the two personals keep the organization’s end goal in mind, then it will set the stage for future collaboration.
“They do have a lot of similar concerns. They’re both thinking about data and algorithms that process data,” Boyd says. “These are the only two disciplines that are thinking about them in conjunction with each other. They’re just thinking about them in different ways, so it’s great for them to unite and find common ground.”
Opening the line of communication can help to identify common stumbling blocks that afflict data organizations. Boyd once worked at a company that struggled to deliver models to production because the ball was always in the other team’s court. The data engineers said they would put a model into production when they had one that was good enough, while the data scientists refused to put time into working on a model that had nowhere to run.
“It wasn’t until we sat down together in a room…to say we’re going to set a goal together, together we’re going to deliver this and make this happen,” he says. “That’s bridging the organization gap. We’re on the same team. Let’s deliver the goal together.”
Once data engineers and data scientists are working on the same page, there are some other things that each side can to do help the relationship become stronger.
For example, Boyd recommends that data engineers set up highly structured sandbox environments where data scientists can play to their heart’s content. If the environment is set up correctly, the engineers will be able to productionalize just about anything the data scientists build inside it without a massive engineering effort, thereby eliminating one of the sources of contention.
“The data scientist will have enough room to explore, to run experiments, to figure out what they need,” Boyd says. “And they can come back and say ‘If we just had this one extra app, we could do this.’ That’s been successful in the past.”
Data scientists, on the other hand, can make life better for their data engineering colleagues if they work to prioritize their wish list. Instead of giving their friendly neighborhood data engineer a list of every piece of data they might possibly want, give them a prioritized list of the top 10, preferably with an explanation of how important it will be for the organization.
It’s all about respecting data engineers’ time, Boyd says.
“Typically, data engineers end up having to prioritize across the organization,” he says. “They want to ruthlessly prioritize things. So if you can tell them ‘This is what I need. Here’s how much lift it gets us,’ it just makes it so much easier for them to say yes.”
Eventually, this approach breeds trust and empathy between the data scientists and data engineers. That trust and empathy is critical when those edge cases pop up, where data scientists are working more on a hunch that something will pan out, but can’t yet prove it with data.
“There are definitely times where you say ‘I don’t know the full impact of this. I think it’s going to be impactful, but we need to run experiments,’” Boyd says. “If you’ve done that work to build that empathy and build that relationship, it’s going to make it that much easier for data engineers to meet you halfway and say, ‘I’m willing to go out and a limb and help you out with this. I know you’re not just coming at me with everything you can think of. You’re trying to be very disciplined about it.’”
These soft skills are not taught in data science bootcamps, and they’re typically not found in university programs. They’re not hard and fast either, as technology and tools continue to evolve at a rapid pace, forcing the folks who are tasked with driving a data agenda to shift what they do and how they do it. Feature stores, for example, are helping to automate the generation of features in machine learning.
But knowing how to better work with one’s colleagues never goes out of style.