In Search of the Data Dream Team
When it comes to succeeding at big data, the people you put in place are just as important–if not more important–than the products and technologies you use. One of the folks exploring the intersection of people and data is Jesse Anderson, who just kicked off season two of the Data Dream Team podcast.
As a former Cloudera employee, Anderson has had a front row seat to the big data wars of the past decade. The data engineer witnessed firsthand the obsession that many had with implementing the latest technology–whether it’s Hadoop or moving everything to the cloud–while ignoring the importance of having the right people in place to ensure success.
As the managing director of the Big Data Institute, Anderson puts his experience to work helping customers build data teams designed to succeed. He’s also written three books, including Data Teams and Data Engineering Teams.
About a year ago, Anderson was approached by data observability tool provider Soda about collaborating on a podcast. The result is the Data Dream Team, which recently concluded season one with 20 podcasts featuring guests like Paco Nathan, Jordan Morrow, Zhamak Dehghani, and Holden Karau.
Soda CEO Maarten Masschelein, who was a guest on Anderson’s podcast, is also ramping up his own podcast, which will debut this year. Anderson and Masschelein recently took a break from recording podcasts to talk about the data staffing conundrum with Datanami.
The way Anderson sees it, technology is important, but too little time and effort is spent on putting the right people in positions to succeed with big data. The Data Dream Team podcast is a chance to talk to various folks in the industry to gain greater perspective on successful approaches to the people side of the equation, he says.
“In order to be successful with this, you need to have your people right,” Anderson says. “There’s a lot about the technologies and what you need to do, but there is also the people part and there’s the process part that you need to add in there. And without those pieces, you’re not going to be successful.”
In big data, there’s a tendency to look to new technologies for solutions. It’s not altogether unreasonable, and technology certainly has a place, Anderson says. But buying new technology without focusing some time and attention on the people and process side of the equation is a recipe for failure, he says.
“You know the meme, ‘There I fixed it.’ That’s kind of what I think it is,” Anderson says. “’Oh, you have that problem? Oh, just put some Flink in there, put some Kafka in there, and that’ll fix it.’ Now you have two problems, or maybe three or maybe you have 10 now because you took and you plopped the technology in there.
“I think it’s been an issue that leaders have been told ‘You just need our technology in there,’ instead of saying ‘Yeah, your technology has a place, but make sure you’ve got the team right there too.’”
Engineers are inherently curious, and like to play with new technology, Masschelein says. And that’s not a bad thing. But that tendency needs to be tempered with a dose of reality when it comes to bringing new technology to production.
“People are always curious, especially technologists. You want to try out the latest, want to understand it…try it in a business context, try to solve the problem with it,” he says. “But I think where it fails is that…we invest in it, and then we expect it to work.”
The gulf that exists between data science and data engineering is nothing new. But in Masschelein’s view, the industry as a whole is due for a reconciliation based on the imbalances that have been created by a desire to play with new technologies.
“We definitely saw an overinvestment in data science,” he says. “I made the mistake myself. So guilty as charged. We are not investing enough in data engineering. So you can prove out a concept with a data scientists, but you cannot bring it to production.”
Hadoop is widely viewed as a failure. But Anderson doesn’t necessarily view it that way. While the technology was definitely overhyped, all too often the individual failures of big data projects could be traced to–you guessed it–and imbalance between expectations and reality, and not having the right folks on the ground to support it, Anderson says.
“I’m former Cloudera, so I got to study that in person at all sorts of different companies,” he says. “Hadoop worked just fine. It had its problems. It had its pointy parts, but it wasn’t an issue of Hadoop usually. It was an issue of they failed at this project now they’re choosing Hadoop as the next silver bullet.”
Now that Hadoop has lost its luster, folks are moving onto the next shiny objects, which happens to be Kubernetes and object stores in the cloud. With changes in underlying assumptions and a new approach to project management (which requires strong leadership), don’t expect the results to differ much from the Hadoop experiment, Anderson says.
“Cloud wasn’t a silver bullet either,” he says. “You had to go back and you had to look at why are you failing at those things.”
When Anderson left Cloudera, Kubernetes was this strange technology that maybe would have some real-world impact far in the future. Mesos seemed to be winning that war. But the container orchestration layer has actually matured much more rapidly than he would have expected. As far as technologies go, Kubernetes has the potential to be a powerful lever to do big things with data, he says. But it comes with a caveat.
“Kubernetes is further ahead than I thought we would have been in 10 years,” he says. “How long did it take people to spin up a Hadoop cluster or Spark cluster, what have you? I can tell you, it took people hours. Go on the cloud now, spin it up–that’s all running Kubernetes behind the scenes.”
The use of Kubernetes changes the types of individuals companies need on their data teams. It also allows companies to get more out of those individuals, provided they use the technology for what it’s good at and effectively pivot data engineers to more high-valued work, Anderson says.
“Your operations team may be smaller, but it never goes away completely,” he says. “That’s a really key point for people.”
Keeping your data team up with the latest technology is obviously important. But it’s also important to shape your data team in conjunction with the maturation of technology. Here, Kubernetes provides a lesson.
“When I work with companies, when I consult with companies, I say get out of the operations game as much as possible,” Anderson says. “And you do that because you get your people working on the things that make you money rather than solve problems. Certain parts of operations are solved problems. Pay somebody for those solved problems. Focus on your business problems. That’s going to make you money.”
A businessperson may flinch when asked to pay $1 or $100 or $1,000 a month for a cloud service, Anderson says. But if the service can automate something that used to take an engineer dozens of hours to do manually, it may be a better deal to go with the cloud service.
“This roll your own mentality also gets you into problems that you don’t need to go and solve this yourself,” he says. “Maybe it’s a scratch [you need to itch]. But that isn’t what we need to do now. So it’s really getting the team to think about it in those terms.”
There’s an inherent complexity in distributed systems that will never go away. In his new book, Data Teams, Anderson explores a theory that there is no such thing as an easy-to-use general-purpose distributed system. Spark is a technology that can be molded to do many things, but it requires time and effort to get it there. That should inform how data leaders seek to assemble their teams.
“No-code won’t work because we are used to always customizing,” Anderson says. “People have to go to [no code] rather than it to them….I think that at its core is why we can’t get no-code. the business will never say, ‘Oh we’ll just accede to what it can do and how it does it.’”
The data mesh is another topic that Anderson has tackled with the Data Dream Team podcast, and which likely will come up again. Anderson seems to agree with much of what Dehghani says about data meshes, particularly when it comes to people. But he has some concerns about the acceptance of data silos, which it sees he will get to discuss with her again soon.
“My quote’s in Zhamak book. I think I said something like ‘We’ll all be asking ourselves why we weren’t doing this sooner,” Anderson says. “I have a few differences. But I think by and large, this is what’s going to allow us to do things well, and even better, but we have to do it right. I think that’s the key. We have to actually do the things that she’s talking about. Not just focus on the technology. She talks about the technology. She always says it’s a socio-technical. You get the socio side, that’s really what we’re talking about.”
The core principle of data meshes–that distributed data teams will be responsible for their data while adhering to a few core unifying principles–generally is good. But acquiescing to data silos seems to cut against Anderson’s grain.
“That’s what one of the niggles that I have with the book, is I’ve seen the effects of lots of data all over the place and no centralization,” he says. “It’s not a good scene, so that’s part of what I’m excited to discuss with her. I’ve seen the effects of some of this organic growth if it doesn’t happen right.”
You can access the Data Dream Team podcast at dreamteam.soda.io.