Follow Datanami:
October 12, 2023

TileDB Wants to Flip the Script on Multi-Modal Databases

(Image source: TileDB)

Being just one type of database is too limiting these days, so many databases have gone multi-modal. Why be just a key-value store when you can be KV plus graph? Document databases are great, but adding a time-series data type makes it better! And everybody stores vectors anymore. But the folks behind TileDB aren’t buying into the multi-modal trend and instead are seeking to invert the paradigm with a fundamentally different approach.

It’s safe to say that TileDB founder and CEO Stavros Papadopoulos isn’t the biggest fan of multi-modal databases.

“They’re committing to a single data type–a table or a document or key-value–but still it’s a table. It doesn’t change,” he says. “And then their force-fitting the other data types within this.”

The result of the compromise a multi-modal database makes is a decrease in performance, but that decrease cannot be tolerated in some of the biggest analytic use cases in genomics, LiDAR imaging, and geospatial IoT data that TileDB customers are doing.

For example, multi-modal database vendors say they can store large image files, or a binary large objects (BLOBs), within their database, but according to Papadopulos, in fact they’re storing BLOB files in an external object store or even Dropbox, and then linking to the BLOB’s IP address from the database. “That’s not multi-modal, man,” he says. “It’s like storing a file inside your table.”

With his distributed, in-memory TileDB database, Papadopoulos is taking a fundamentally different approach. Instead of trying to fit different data types into a database that’s fundamentally designed to handle one data type, he built TileDB with a more versatile underlying data type: the multi-dimensional array.

“TileDB adopts the multidimensional array as a first-class citizen, and this array has one attribute–it shapeshifts,” Papadopoulos says. “It morphs. It can become a table. It can become an image. It can become document. It can become a key-value, because it changes. It has different dimensions. It is dense or sparse. It has all this functionality built into the model that allows it take to take different shapes.

“So instead of force-fitting data to a stiff data type,” he continues, “we do the other way around. We are shifting the paradigm. We take the data that’s structured and we change it and we apply perfectly to each of the data types, and that gives you performance.”

Papadopulos founded TileDB after gaining his PhD in computer science and engineering from the Hong Kong University of Science and Technology and spending time as a senior research scientist at Intel Labs and as a visiting scientist at both the Broad Institute and MIT.

At MIT, Papadopoulos worked under Turing Award winner Mike Stonebraker, and took an interest in one of his creations: SciDB, a column-oriented database designed to store multi-dimensional data in mathematical arrays for scientific applications. Papadopulos took those learnings and created TileDB, which was spun out of MIT and Intel in 2017.

The capability to store data in multi-dimensional arrays gives TileDB the flexibility and performance needed by today’s toughest analytic challenges, Papadopulos says. There are three core characteristics of TileDB that differentiate it from other databases, he says.

The TileDB database can support dense and sparse arrays (Image source: TileDB)

“Number one, it handles all data,” he says. “Not just tables. Not just files. Not just key-value. It handles all modalities. And this is extremely important because you don’t have to buy 10 different database systems if you were dealing with 10 different data modalities.

“The second thing is that we’re integrating the code in the database,” he says. “We don’t believe that the code should live elsewhere. And if it lives in GitLab still, it needs to be managed alongside the data because that’s how the developers use the code and the data in the organization.

“And number three, our compute goes way beyond SQL,” he continues. “SQL is just one API for us. We have multiple other APIs. We have a generic distribution, which we build ourselves, and a distributed serverless engine, where you can spin up pretty much anything. You can spin up user defined functions. You can spin up task graphs for complex workloads. You can spin up Jupyter notebooks. You can spin up any kind of Web application within the same environment.”

Stavros Papadopoulos is the CEO and founder of TileDB

The open-source component of the TileDB database features APIs for Python, R, Java, Julia, Go, C, C#, and C++, enabling application developers to use the database with a range of different applications. The database integrates with Apache Arrow, providing compatibility with SQL engines like MariaDB, Trino, and Presto; computational frameworks like Dask and Spark; data science tools like Pandas, Numpy, and Vaex; as well as machine learning frameworks like PyTorch, TensorFlow, and scikit-learn.

TileDB, which was written in C++, integrates with object stores, including S3, Azure BLOB Store, Google Cloud Storage, and Minio, primarily for persistence purposes, Papadopoulos . Users will pull in data they want to analyze and store it in TileDB’s columnar format, which he calls “Parquet on steroids” because it allows users to pick two or three parameters to optimize the layout of the data on disk.

When Papadopoulos started the company, it was just him. Now the Cambridge, Massachusetts based company has about 50 employees, including a full team of engineers working to build and support the database for enterprise use cases. Its customers use the database for high-end scientific analysis of very large data sets in industries like pharmaceuticals, oil and gas, autonomous vehicles, and others.

This week the company announced that it’s raised $34 million in a Series B round led by AlleyCorp, the venture capital firm led by Kevin Ryan, who is the co-creator of MongoDB. Participating in the round were Two Bear Capital, Nexus Venture Partners, Big Pi Ventures, Intel Capital, Uncorrelated, Lockheed Martin Ventures, Amgen Ventures, NTT Docomo Ventures, Verizon Ventures, S Ventures, LDV Partners, and Scale Asia Ventures.

The Series B shows that the company is for real and that its database is ready for production deployments, Papadopoulos says. “It’s not just a hypothesis anymore,” he says. “We just raised our Series b round, so now we can prove that it works extremely well.”

Related Items:

TileDB Adds Vector Search Capabilities

Array Databases: The Next Big Thing in Data Analytics?

Inside Pandata, the New Open-Source Analytics Stack Backed by Anaconda

 

Datanami