Couchbase to Deliver Parallel JSON Analytics — Without the ETL
Couchbase yesterday unveiled a JSON-native analytics engine it claims will let users perform parallel ad-hoc analytics on operational data just milliseconds after it’s landed in the NoSQL store. The Analytics Service, which is based in part on the Apache AsterixDB and SQL++ projects from academia, has been dubbed “NoETL,” but don’t let the catchy phrase fool you.
The new Analytics Service is the cornerstone of Couchbase Server 6.0, which the company launched yesterday at its Couchbase Connect SV conference at the San Jose-McEnery Convention Center, and which is slated for delivery in October. The company dedicated a significant portion of the one-day show to explaining how it built the new analytics engine, what impact it will have on customers, and running a real-time demo where it spun up an analytics cluster in less than five clicks.
For many businesses, analytics is heavily dependent on the extract, transform, and load (ETL) process, whereby data is lifted out of a production system and shunted over to a data warehouse. But ETL processes can take months to define, are susceptible to shifting schemas, and typically require flattening of the rich hierarchical data structures that exist within JSON documents.
Numerous methods have been devised to bypass ETL, including the Lambda architecture, where data streams are split into separate pipelines and landed in various optimized stores for operational or analytics use case. Another technique, dubbed hybrid transactional analytical processing (HTAP), has been put forth by the smart folks at Gartner as a more efficient way to analyze data as it sits in production systems, usually scale-out relational databases. (Increasingly, real-time stream processing systems, in-memory databases, and even in-memory data grids are being called on to solve the same elemental problem, but that’s another story.)
Couchbase’s new Analytics Service takes a page out of Gartner’s HTAP playbook, but with substantial caveats, including provisions to maintain the rich JSON data structures that customers rely upon to serve their operational computing needs. The company began researching what a native-NoSQL HTAP system would look like four years ago, when Couchbase CTO Ravi Mayuram reached out to University of California Irvine computer science professor Mike Carey about a project he was working on called AsterixDB, which Carey started about 10 years ago in response to the seismic architectural shift that was occurring.
“There was an elephant in the room, Hadoop, which was starting to happen,” Carey told Mayuram on stage at Couchbase Connect SV. “Clearly the data was a mess. It was a much more complicated that rows and columns. And databases were kind of being forgotten. So people were kind of reinventing that technology without being aware of it. Those of us who were SoCal database people said, this is not the right way to go. Let’s build a parallel database for that kind of flexible data, and that was the Asterix project.”
Couchbase worked with Carey as an advisor, and used the open source AsterixDB as the foundational technology for the new database engine component of the Analytics Service that has been in beta for the past six months (and which will not be open source). There was a considerable amount of work that went into pairing the new database engine to work with the Couchbase database. Much of the work involved adapting Couchbase’s Data Change Protocol (DCP), which is basically stream protocol that provides very fast memory-to-memory replication of data sitting in the NoSQL engine, to work with the AsterixDB-based technology that would become Analytics Service.
“There were two big technical problems,” Cary said. “One is you were doing a lot of transactions on front end, gathering a lot of data that had get swallowed into the system where you could ask the big questions without perturbing the front end. So it was a data ingestion issue there. We had to look at how could get the data in so that it moved quickly, and how do it we get it at rest quickly so that you can query it.”
The solution hinged on the “data feeds” technology that Carey and his UCI researchers developed for AsterixDB, which Couchbase utilized as the basis for the log-structured merge trees, or LSM-trees, that do the work to land the data on the separate Analytics Service nodes in a way that makes them query-able.
Next, the company needed a query language to pair with the database engine and the table-like views of JSON data materialized by the in-memory LSM-tree algorithms. The company already had N1QL, which is very similar to SQL and provides some analytic capabilities for native JSON data. But N1QL had some shortcomings that made it a poor fit for the new service – namely, it did not run in a massively parallel fashion to leverage Couchbase’s shared-nothing distributed architecture. And N1QL needed data to be pre-organized with indexes, which eliminated analyst’s ability to run ad-hoc queries to explore the data.
This is where Couchbase dipped into the academic waters a second time, and it pulled out a technology called SQL++, which is a declarative query language for JSON data that’s backwards compatible with the ANSI SQL standard. SQL++ is the subject of an ongoing research project at University of California, San Diego under computer science and engineering professor Yannis Papakonstantinou, which has been funded in part by a grant from Couchbase.
There was more work in getting SQL++ to work with Couchbase before it delivered it as the first commercially supported implementation of SQL++, which the comapny is doing with N1QL for Analytics, which utilizes SQL++ under the covers. “We had to optimize it, number one to understand DCP, then bring it into the Couchbase domain,” Sachin Smotra, director product management at Couchbase, told Datanami. SQL co-creator Don Chamberlin, who is an advisor to Couchbase and appeared on stage yesterday, has been involved with SQL++ and also helped the company with the implementation.
The fourth piece of the puzzle revolved around maintaining isolation from the transactional side of the Couchbase cluster. The company did some work to ensure that customers would not be subject to the “noisy neighbor” problem, whereby big SQL++ queries (via NiQL for Analytics) begin to consume resources and impact performance of the transactional cluster. The company, which recently introduced a Kubernetes operator, says customers can get separation either with dedicated hardware or containerized workloads.
When you put all this together, you get Analytics Service, which is the fourth fifth optional component in the NoSQL data platform, along with mobile, eventing, and N1QL query. During a demo on stage, Smotra and Mayuram went through the process of “re-balancing” data from the production Couchbase cluster into an Analytics Service (in only four clicks) and then running a series of queries to discover what was occurring with hypothetical IoT data set from a connected car system. There were anomalies detected in tire pressure monitoring system (TPMS) systems, and the Couchbase execs demonstrated how quickly customers could ask questions of the data, visualize the results (through a Knowi layer), and then run new queries based on the results (such as determining if there were patterns in the data, like specific models or years exhibiting the same TPMS anomalies). Such free-ranging inspection of the data comes intuitively for humans, but it’s actually a really challenging technical thing to pull off with machines that haven’t organized the data that way.
One early Analytics Service adopter that’s impressed so far is Steven Wyant, the data architect for the Cincinnati Reds, which started using a beta copy of the service about six months ago when the Reds started their baseball season. “I didn’t necessarily know what we were going to use Couchbase for off the bat,” Wyant told Datanami. “We had a bunch of things we wanted to look at.”
The Couchbase Analytics Service allowed the Reds organization to track what tickets people bought and where they actually sat within the 42,000-seat Great American Ballpark for (almost) an entire season, which delivered better in-game understanding of what’s happening during games, as well as better insight into potential price changes for 2019. “The [Analytics Service] allows for that in a quicker manner than N1QL, where you have to kind of know operationally what you’re after,” said Wyant, who previously helped Kroger build a Hadoop cluster for Hive analysis. “That’s really why we brought it in, to look at that in a more digestible and more interactive manner than what we can do in our current infrastructure.”
In the end, Couchbase pulled off its NoETL pledge in the demo, which was impressive. But equally remarkable is how it enabled analytics on JSON data via SQL++, without modifying it. That could appeal to a lot of Couchbase customers, who today must build and maintain ETL pipelines that flatten the JSON. “This is a really significant problem to solve,” says Scott Anderson, Couchbase’s senior vice president of product management. “You have this incredibly rich data structure, nested objects and so forth. How do you derive the intelligence beyond what we do with NIQL and how do we extend the capacity, and then solve some of the problems that Ravi talked about, which is the fragility of the pipeline?”
Couchbase executives fretted that they had hid too much of the complexity that went into Analytics Service and made it seem too simple in the demo. That led Mayuram to warn the attendees: “Don’t try this at home. Don’t think that you can solve this problem by hodge podging three different open source projects together. That’s not the case here….This is some serious stuff that we’ve built here.”
July 19, 2019
- Insight Adds Rubrik Cloud Data Management Solutions to OneCall Support and Managed Services Portfolio
- Infor Partners with GTY Technology to Fuel Digital Transformation in Public Sector
- Machine-Learning Competition Boosts Earthquake Prediction Capabilities
July 18, 2019
- Quantum to Speak on High-Capacity Archive Storage at SVG Sports Content Management Forum
- Valen Unveils New ‘Unavailable Loss History’ Model for Workers’ Compensation
- Toshiba Memory to Rebrand as ‘Kioxia’ in October
- SUSE Joins the iRODS Consortium
- DataRobot Launches AI for Good: Powered by DataRobot
- WANdisco, Neudesic Partner to Migrate Hadoop Analytical Workloads to Databricks in the Azure Cloud
July 17, 2019
- Qubole Named a Leader in the G2 Crowd Big Data Processing and Distribution Software Report
- Ascend Launches with $19M in Funding to Create Automated and Intelligent Dataflows
- Fusionex Hackathon Grooms Students in Industry Application of Data Analytics
- LexisNexis Launches Context for Courts, Delivering Venue-Specific Insights for Data-Driven Litigators
- Iron Mountain and 451 Research Offer Insight into Unlocking and Monetizing Your Unstructured Data
- Syncsort Survey Reveals Disconnect Between Data Trust and Data Quality
July 16, 2019
- Cloudera and ISID Partner to Build Integrated Platform for Mizuho Americas
- Nano Puzzle for More Stable Data Storage
- iBASIS Turns to Infinidat to Upgrade Overall Storage Performance
- AdhereHealth Selects Paxata to Accelerate Medication Adherence Solution
- SnapLogic Launches AWS Quick Start Solution to Accelerate Big Data Initiatives
Most Read Features
- Hitting the Reset Button on Hadoop
- Big Data File Formats Demystified
- Is Hadoop Officially Dead?
- Hadoop Struggles and BI Deals: What’s Going On?
- The 4 Paradigms of Data Prep for Analytics and Machine Learning
- Big Data Is Still Hard. Here’s Why
- 10 Big Data Trends to Watch in 2019
- How to Build a Better Machine Learning Pipeline
- Cloudera Commits to 100% Open Source
- ‘Data Scientist’ Title Evolving Into New Thing
- More Features…
Most Read News In Brief
- MapR Says It’s Close to Deal to Sell Company
- Cloud Now Default Platform for Databases, Gartner Says
- Argonne Team Makes Record Globus File Transfer
- After Funding Falls Through, MapR Seeks a Buyer to Avoid Shut Down
- Cloudera Unveils CDP, Talks Up ‘Enterprise Data Cloud’
- California’s New Data Privacy Law Takes Effect in 2020
- Inside Fortnite’s Massive Data Analytics Pipeline
- Tibco Eyes ‘Data Science for Ops’ with Spotfire Upgrades
- War Unfolding for Control of Elasticsearch
- Global DataSphere to Hit 175 Zettabytes by 2025, IDC Says
- More News In Brief…
Most Read This Just In
- IBM Unveils New Data Prep Tool Designed to Help Speed DataOps
- Sinequa Raises $23 million to Accelerate the Transition Beyond Data-Driven to an Information-Driven Economy
- MicroStrategy 2019 Update Brings HyperIntelligence to Mobile Users, Injects Analytics into Business Applications
- Most Enterprises Don’t Trust Their Data, According to Talend Survey
- What’s My Line? GPUs Help Researcher Decipher Ancient Sanskrit
- Syncsort Delivers Mainframe Data to Microsoft Azure to Unlock New Business Insights
- Microsoft, Providence St. Joseph Health Announce Strategic Alliance to Accelerate the Future of Care Delivery
- Attunity Wins Microsoft 2019 MSUS Partner Award for Intelligent Cloud-Data Estate Modernization
- Talend Delivers Pay-as-You-Go On-Ramp to Accelerate Integration Projects
- EnterpriseDB Acquired by Great Hill Partners
- More This Just In…