Follow Datanami:
November 13, 2013

Dat Wants to be the GitHub for Data

Isaac Lopez

Git and GitHub changed how programmers work together by enabling open and scalable collaboration, posited Max Ogden, an independent developer. At the Strata Conference in London this week, Ogden revealed “Dat,” a Git-like prototype for real-time streaming data.

Ogden says he started the project to address challenges that developers are facing as they try to use Git as a data vehicle. Ogden says that Git can (and has) been used as a data sharing tool. However because it was designed for source code, it has its drawbacks when used for real-time data committing – it’s difficult to share, it requires custom code in order to extract and land in a database, and it doesn’t scale well once it reaches a point of about 1 million commits.

To tackle this challenge, Ogden says he’s developed Dat, a streaming data sharing prototype being funded by the Knight Foundation, which aims to underwrite “transformational ideas” – they’re the group responsible for funding the civic hacking group, Code for America.

“What it is, is a streaming, real-time table replicator,” says Ogden, who explained a little bit about how he constructed the Dat framework. Technologies used include the following:

  • Node.js – “Node is one of the fastest growing ecosystems,” says Ogden. “It’s a distributed I/O platform for managing I/O on different platforms and sending data around really fast.”
  • levelDB – Built by Jeff Dean and Sanjay Ghemawat at Google (who also wrote BigTable and Spanner), Ogden explains that levelDB is actually the data for a single BigTable tab-let. “It’s a building block for a distributed system,” he explained.
  • NPM – Saying that NPM is the sister to Node.js, Ogden explains that NPM is the package manager for the streaming system.

Using this framework, Ogden says he has built a prototype in which data can be streamed in and transformed, updating the data stream in real time. “Patches are to code what transforms are to data,” he explained giving a corollary. “With data you can use transforms as your stand in for patches… Transforms can operate on streams of data.”

Ogden explained that in the era of Open Data, there is a gap between the implementers of data (the NASAs, federal and local governments, and research organizations who are providing the data), and the end users and app makers of the data. “They have their data in a raw format, but the people who want to use it (say data journalists, researchers, civic hackers, etc.), they might not want to learn the NASA formats at a low level – they just want CSV, or JSON, or something nice. They just want a SQL database to pop up so they can just play with it,” he says.

While there are some specialized tools in the middle to help facilitate the ease of use (such as R, NumPy,  and data format parsers), Ogden argues that there needs to be better tools to bridge the gap. Ogden hopes that Dat can be that bridge. “It’s built for large data sets,” he says. “Everyone can be synchronizing almost like DropBox style, but for spreadsheets… when you update data in one place it syncs out to everyone else.”

“I’m so glad I live in the Git and GitHub age, because it means I’m not just limited to the company that I work at, and the people that I work with, but I can work with people around the world in a totally distributed fashion,” Ogden told his audience during his session at the Strata conference. He says that he’s most excited about the potential to build community around Dat to leverage the data.

“My goal with Dat is [to have] importers and exporters that are separate, modular pieces of functionality that can be shared,” he explained, basically describing plug and play data streams. “I want to enable a vibrant ecosystem of reusable modules,” said Ogden revealing the ultimate goal of the project.

Sill in Alpha, Ogden says that Data will be hitting a Beta release soon, and will be available at dat-data.com. For those interested, the project is being developed in the open at GitHub here.

Related items:

Civic Hacking Targets Optimizing Government 

FoundationDB Gets $17M to Push ACID Machines 

Amazon Hosting 20 TB of Climate Data 

Datanami