From Wall Street to Main Street: Inside Deephaven’s Big Data Journey
In this post-Hadoop world, we’ve seen a number of data architectures emerge and gain traction. One of the more interesting ones is Deephaven, which was originally developed last decade to power a quantitative-driven hedge fund, and which is now being offered to the world as an open platform for real-time analytics and machine learning.
Pete Goddard founded Walleye Capital in 2004 with the idea of using lots of data and fast computers to make a lot of money for his clients. Goddard oversaw the development of a system called Deephaven that enabled Walleye analysts to query large amounts of fast-moving data in real time, thereby giving his clients a competitive advantage on the stock market. He made a lot of money for his clients.
In 2016, Goddard spun Deephaven Data Labs out as its own company, with the idea of using the Deephaven system to solve data challenges in other ways. Over the past five years, the company has attracted a number of customers in a variety of industries, including healthcare, manufacturing, and even automobile racing. Now the company is looking to expand its presence and product usage by embracing the open source community.
“It was certainly interesting and fun to be at the helm of a trading company and using this, as well as some other technologies we had, to make money. I did it for a long time,” Goddard tells Datanami. “We think we’re in a unique spot now. We understand a different way of doing things. We’ve seen it work. We know how powerful it can be. And now we want to bring it to the community in an open way.”
A New Data Framework
So, what is Deephaven? That is not such an easy question to answer. The company’s website says that, at its core, Deephaven is a column-oriented database. A spokesperson for the company described it as a time-series database. Asked to expand on that, Goddard hedged a bit.
“Fundamentally, it’s two different things,” he says. “It’s a data engine, and then it’s a data framework.”
As a data engine, Deephaven works similarly to other compute engines, such as Apache Spark or a SQL query engine, Goddard says. Users can query the data, which is typically stored in a Parquet format, and even bring machine learning models developed in Python or Tensorflow against that data. But unlike many big data products, there is no Spark to be found inside of Deephaven. And there also is no SQL interface.
“It is a new way of working with data to produce analytics, to develop applications,” Goddard says. “It does not sit on top of other data engines. It is its own version thereof.”
As a framework, Deephaven, which was developed in Java, provides a lot of the other “stuff” that users need to be productive with the software. That includes data connectors, APIs, interoperability with other tools, and user interfaces that allow users to work directly with the data ingested into the system. When it comes to machine learning, the software can execute models developed in Python, Tensorflow, and Numba.
But that isn’t the full description of what Deephaven does, either. According to Goddard, what Deephaven really excels at is enabling analytics and machine learning on real-time data.
“We are unlike any other data system that exists out there in our in our ability to both handle real-time data, dynamic data, and to allow a user to seamlessly move between historical static data and dynamic real-time data,” Goddard says. “We are, under the covers, observing adds, deletions, updates, modifications, and we’re keeping state in interesting ways so that we can incrementally compute stuff instead of doing whole computes again on some sort of cycle.”
Data Stamping Time
Keeping track of when an event occurred is critical in executing trading strategies, and it’s becoming increasingly important in other industries, particularly for organizations that want to squeeze insights from high-volume event data. For Goddard, the key deliverable is enabling his clients to recall the state of the world at any given point in time.
“There could be two data sources you care about, or there could be thousands of data sources,” he says. “I just did a trade in Apple. Well, what happened on Twitter one second right before I did the trade in Apple? Is there spiking Twitter volume around Apple, and therefore maybe that’s a hint to me that the world knew something that I didn’t and I just got run over?
“There’s all this distinct data in the world that can flow in a number of ways,” he continues, “and I need to be able to bring it together really well based on timestamps, meaning they’re here now, or I want to do this study from 10 minutes ago. That can be pretty important.”
At a technical level, Deephaven has the capability to accept real-time data flows from pub/sub systems, such as Kafka or Solace, and join that with static data sitting in a Parquet file, and “in a very lightweight way, unlike KSQL, deliver derived streams on top of streams to consumers, either via APIs or via user experiences,” Goddard says. “That will exist out of the box.”
Deephaven, which runs in a distributed manner, also plays nicely with data stored in the Apache Arrow and Arrow Flight data formats, and Goddard is looking to expand Deephaven’s presence in that little corner of the open source community. In fact, Deephaven has contributed a new feature to the Arrow project that enables the data format to better understand changing data.
The company is making Deephaven available under a “source available” license. The idea is to attract more users to Deephaven, with the hopes that developers will take the ball and help to further integrate it with the open source community.
“There’s quite a bit of interesting intellectual property under the covers, and the important bits of that are now out in the open for people to be able to see in our code base,” Goddard says. “But I don’t think many developer or community members will care how it works. They’ll just be able to use it and be excited that it works.”
Data Meets Software
Goddard seems to relish his status as an outsider. After spending over a decade in the pressure cooker of Wall Street, the Illinois native doesn’t seem interested in fitting into Silicon Valley’s preconceived notions of software categories.
When it comes to data, whether Deephaven should be described as a column-oriented, time-series database or a streaming analytics framework or hybrid real time-batch processing system, those words don’t mean much to Goddard.
“The big difference between us and everyone else is we came from the outside and therefore we think of this stuff is a continuum,” he says. “I just think of data driven stuff as data meets software. Everyone else puts it into a box. I’m like, I don’t care if that’s one of the boxes. Data meets software could be real time or it could be batch. Data meets software could be an application. It could be analytics. It could be a visualization for a business analyst or it could be data science or whatever.”
The company has worked with a range of customers, including those in capital markets, healthcare telematics, and even a Formula 1 race car team. The common feature linking all of these customers is a desire to derive insights on large amount of fast-moving data.
“This isn’t a science project,” Goddard says. “This is working technology that some of the biggest heavyweights in the capital markets are using for critical path stuff…..These are things that our current customer are doing, and they’re very sophisticated people who could be choosing other stuff to use.”