Object Stores Starting to Look Like Databases
Don’t look now, but object stores – those vast repositories of data sitting behind an S3 API – are beginning to resemble databases. They’re obviously still separate categories today, but as the next-generation data architecture takes shape to solve emerging real-time data processing and machine learning challenges, the lines separating things like object stores, databases, and streaming data frameworks will begin to blur.
Object stores have become the primary repository for the vast amounts of less-structured data that’s generated today. Organizations clearly are using object-based data lakes in the cloud and on premise to store unstructured data, like images and video. But they’re also using them to store many of the other types of data, like sensor and log data from mobile and IoT devices, that the world is generating.
The object store is becoming a general purpose data repository, and along the way it’s getting closer to the most popular data workloads, including SQL-based analytics and machine learning. The folks at object storage software vendor Cloudian are moving their wares in that direction too, according to Cloudian CTO Gary Ogasawara.
“We’re moving more and more to that,” Ogasawara tells Datanami. “If you can combine the best of both worlds – have the huge capacity of an object store and the advanced query capability of an SQL-type database – that would be the ideal. That’s what people are really asking for.”
Past Is Prologue
We’ve seen this film before. When Apache Hadoop was the hot storage repository for big data (really, less-structured data), the first big community efforts was to develop a relational database for it. That way, data analysts with existing SQL skills – as well as BI applications expecting SQL data – would be able to leverage it without extensive retraining. And besides, after running less-structured data through MapReduce jobs, you needed a place to put the structured data. A database is that logical place.
This led to the creation of Apache Hive out of Facebook, and the community followed with a host of other SQL-on-Hadoop engines (or relational databases, if you like), including Apache Impala, Presto, and Spark SQL, among others. Of course, Hadoop’s momentum fizzled over the past few years, in part due to the rise of S3 from Amazon Web Services and other cloud-based object storage systems, notably Azure BLOB Storage from Microsoft and Google Cloud Storage, which are universally more user-friendly than Hadoop, if not always cheaper.
In the cloud, users are presented with a wide range of specialty storage repositories and processing engines for SQL and machine learning. On the SQL front, you have Amazon RedShift, Azure Data Warehouse, and Google BigQuery. On top of these “native” offerings, the big data community has adapted many existing and popular analytics databases, including Teradata, Vertica, and others, to work with S3 and other object stores with an S3-compatible API.
The same goes for machine learning workloads. Once the data is in S3 (or Blob Store or Google Cloud Storage), it’s a relatively simple manner to use that data to build and train machine learning models in SageMaker, Azure Machine Learning, or Google Cloud AutoML. With the rise of the cloud, every member of the big data and machine learning community has moved to support the cloud, and with it object storage systems.
As the cloud’s momentum grows, S3 has become the defacto data access standard for the next generation of applications, from SQL analytics and machine learning to more traditional apps too. For many new applications, data is simply expected to be stored in an object storage system, and developers expect to be able to access that data over the S3 API.
A Hybrid Architecture
But of course, not all new applications will live on the cloud with ready access to petabytes of data and gigaflops of computing power. In fact, with the rise of 5G networks and the explosion of smart devices on the Internet of Things (IoT), the physical world is the next frontier for computing, and that’s changing the dynamics for data architects who are trying to foresee new trends.
At Cloudian, Ogasawara and his team are working on adapting its HyperStore object storage architecture to fit into the emerging edge-and-hub computing model. One of the examples he uses is the case of an autonomous car. With cameras, LIDAR, and other sensors, each self-driving car generates terabytes worth of data every day, and petabytes per year.
“That is all being generated at the edge,” he says. “Even with a 5G network, you will never be able to transmit all that data to somewhere else for analyses. You have to push that storage and processing as close to the edge as possible.”
Cloudian is currently working on developing a version of HyperStore that sits on the edge. In the self-driving car example, the local version of HyperStore would run right on the car and assist with storing and processing data coming off the sensors in real time. This computing environment would constitute a fast “inner loop,” Ogasawara says.
“But then you have a slower outer loop that’s also collecting data, and that includes the hub where the large, vast data lake resides in object storage,” he continues. “Here you can do more extensively training of ML models, for example, and then push that kind of metadata out to the edge, where it’s essentially a compiled version of your model that can be used very quickly.”
In the old days, object stores resembled relatively simple (and nearly infinitely scalable) key-value stores. But to support future use cases — like self-driving cars as well as weather modeling and genomics — the object store needs to learn new tricks, like how to stream data in and intelligently filter it so that only a subset of the most important data is forwarded from the edge to the hub.
To that end, Cloudian is working on a new project that will incorporate analytics capabilities. It has a working name of the Hyperstore Analytics Platform, the project would incorporate frameworks like Spark or TensorFlow to assist with the intelligent streaming and processing of data. A beta was expected by the end of the year (at least that was the timeline that Ogasawara shared in early March before the COVID-19 lockdown.)
Cloudian is not the only object storage vendor looking at how to evolve its product to adapt to emerging data challenges. In fact, its not just object storage vendors who are trying to tackle the probolem.
The folks at Confluent have adapted their Kafka-based stream processing technologies (which excel at processing event data) to work more like a database, which is good at managing stateful data. MinIO has SQL extensions that allow its object store to function like a database. NewSQL database vendor MemSQL has long had hooks for Kafka that allow it to process large amounts of real-time data. The in-memory data grid (IMDG) vendors are doing similar things for processing new event data within the context of historic, stateful data. And let’s not even get into how the event meshes are solving this problem.
According to Ogasawara, adapting Cloudian’s HyperStore offering is a logical way to tackle today’s emerging data challenges. “You’ve done very well at building this storage infrastructure,” he says. “Now, how do you make the data usable and consumable? It’s really about providing better access APIs to get to that data, and almost making the object storage more intelligent.”
Object stores are moving beyond their initial use case, which was reading, writing, and deleting data at massive scale. Now customers are pushing object storage vendors to support more advanced workflows, including complex machine learning workflows. That will most likely require an extension to the S3 API (something that Cloudian has brought up with AWS, but without much success).
“How do you look into those objects? Those types of APIs need more and more [capabilities],” Ogasawara says. “And even letting AI or machine learning-type workflows, doing things like a sequence of operations — those types of language constructs, everyone is starting to look at and trying to figure out how do we make it easier for users and customers to make that data analysis possible.”