Couchbase to Deliver Parallel JSON Analytics — Without the ETL
Couchbase yesterday unveiled a JSON-native analytics engine it claims will let users perform parallel ad-hoc analytics on operational data just milliseconds after it’s landed in the NoSQL store. The Analytics Service, which is based in part on the Apache AsterixDB and SQL++ projects from academia, has been dubbed “NoETL,” but don’t let the catchy phrase fool you.
The new Analytics Service is the cornerstone of Couchbase Server 6.0, which the company launched yesterday at its Couchbase Connect SV conference at the San Jose-McEnery Convention Center, and which is slated for delivery in October. The company dedicated a significant portion of the one-day show to explaining how it built the new analytics engine, what impact it will have on customers, and running a real-time demo where it spun up an analytics cluster in less than five clicks.
For many businesses, analytics is heavily dependent on the extract, transform, and load (ETL) process, whereby data is lifted out of a production system and shunted over to a data warehouse. But ETL processes can take months to define, are susceptible to shifting schemas, and typically require flattening of the rich hierarchical data structures that exist within JSON documents.
Numerous methods have been devised to bypass ETL, including the Lambda architecture, where data streams are split into separate pipelines and landed in various optimized stores for operational or analytics use case. Another technique, dubbed hybrid transactional analytical processing (HTAP), has been put forth by the smart folks at Gartner as a more efficient way to analyze data as it sits in production systems, usually scale-out relational databases. (Increasingly, real-time stream processing systems, in-memory databases, and even in-memory data grids are being called on to solve the same elemental problem, but that’s another story.)
Couchbase’s new Analytics Service takes a page out of Gartner’s HTAP playbook, but with substantial caveats, including provisions to maintain the rich JSON data structures that customers rely upon to serve their operational computing needs. The company began researching what a native-NoSQL HTAP system would look like four years ago, when Couchbase CTO Ravi Mayuram reached out to University of California Irvine computer science professor Mike Carey about a project he was working on called AsterixDB, which Carey started about 10 years ago in response to the seismic architectural shift that was occurring.
“There was an elephant in the room, Hadoop, which was starting to happen,” Carey told Mayuram on stage at Couchbase Connect SV. “Clearly the data was a mess. It was a much more complicated that rows and columns. And databases were kind of being forgotten. So people were kind of reinventing that technology without being aware of it. Those of us who were SoCal database people said, this is not the right way to go. Let’s build a parallel database for that kind of flexible data, and that was the Asterix project.”
Couchbase worked with Carey as an advisor, and used the open source AsterixDB as the foundational technology for the new database engine component of the Analytics Service that has been in beta for the past six months (and which will not be open source). There was a considerable amount of work that went into pairing the new database engine to work with the Couchbase database. Much of the work involved adapting Couchbase’s Data Change Protocol (DCP), which is basically stream protocol that provides very fast memory-to-memory replication of data sitting in the NoSQL engine, to work with the AsterixDB-based technology that would become Analytics Service.
“There were two big technical problems,” Cary said. “One is you were doing a lot of transactions on front end, gathering a lot of data that had get swallowed into the system where you could ask the big questions without perturbing the front end. So it was a data ingestion issue there. We had to look at how could get the data in so that it moved quickly, and how do it we get it at rest quickly so that you can query it.”
The solution hinged on the “data feeds” technology that Carey and his UCI researchers developed for AsterixDB, which Couchbase utilized as the basis for the log-structured merge trees, or LSM-trees, that do the work to land the data on the separate Analytics Service nodes in a way that makes them query-able.
Next, the company needed a query language to pair with the database engine and the table-like views of JSON data materialized by the in-memory LSM-tree algorithms. The company already had N1QL, which is very similar to SQL and provides some analytic capabilities for native JSON data. But N1QL had some shortcomings that made it a poor fit for the new service – namely, it did not run in a massively parallel fashion to leverage Couchbase’s shared-nothing distributed architecture. And N1QL needed data to be pre-organized with indexes, which eliminated analyst’s ability to run ad-hoc queries to explore the data.
This is where Couchbase dipped into the academic waters a second time, and it pulled out a technology called SQL++, which is a declarative query language for JSON data that’s backwards compatible with the ANSI SQL standard. SQL++ is the subject of an ongoing research project at University of California, San Diego under computer science and engineering professor Yannis Papakonstantinou, which has been funded in part by a grant from Couchbase.
There was more work in getting SQL++ to work with Couchbase before it delivered it as the first commercially supported implementation of SQL++, which the comapny is doing with N1QL for Analytics, which utilizes SQL++ under the covers. “We had to optimize it, number one to understand DCP, then bring it into the Couchbase domain,” Sachin Smotra, director product management at Couchbase, told Datanami. SQL co-creator Don Chamberlin, who is an advisor to Couchbase and appeared on stage yesterday, has been involved with SQL++ and also helped the company with the implementation.
The fourth piece of the puzzle revolved around maintaining isolation from the transactional side of the Couchbase cluster. The company did some work to ensure that customers would not be subject to the “noisy neighbor” problem, whereby big SQL++ queries (via NiQL for Analytics) begin to consume resources and impact performance of the transactional cluster. The company, which recently introduced a Kubernetes operator, says customers can get separation either with dedicated hardware or containerized workloads.
When you put all this together, you get Analytics Service, which is the fourth fifth optional component in the NoSQL data platform, along with mobile, eventing, and N1QL query. During a demo on stage, Smotra and Mayuram went through the process of “re-balancing” data from the production Couchbase cluster into an Analytics Service (in only four clicks) and then running a series of queries to discover what was occurring with hypothetical IoT data set from a connected car system. There were anomalies detected in tire pressure monitoring system (TPMS) systems, and the Couchbase execs demonstrated how quickly customers could ask questions of the data, visualize the results (through a Knowi layer), and then run new queries based on the results (such as determining if there were patterns in the data, like specific models or years exhibiting the same TPMS anomalies). Such free-ranging inspection of the data comes intuitively for humans, but it’s actually a really challenging technical thing to pull off with machines that haven’t organized the data that way.
One early Analytics Service adopter that’s impressed so far is Steven Wyant, the data architect for the Cincinnati Reds, which started using a beta copy of the service about six months ago when the Reds started their baseball season. “I didn’t necessarily know what we were going to use Couchbase for off the bat,” Wyant told Datanami. “We had a bunch of things we wanted to look at.”
The Couchbase Analytics Service allowed the Reds organization to track what tickets people bought and where they actually sat within the 42,000-seat Great American Ballpark for (almost) an entire season, which delivered better in-game understanding of what’s happening during games, as well as better insight into potential price changes for 2019. “The [Analytics Service] allows for that in a quicker manner than N1QL, where you have to kind of know operationally what you’re after,” said Wyant, who previously helped Kroger build a Hadoop cluster for Hive analysis. “That’s really why we brought it in, to look at that in a more digestible and more interactive manner than what we can do in our current infrastructure.”
In the end, Couchbase pulled off its NoETL pledge in the demo, which was impressive. But equally remarkable is how it enabled analytics on JSON data via SQL++, without modifying it. That could appeal to a lot of Couchbase customers, who today must build and maintain ETL pipelines that flatten the JSON. “This is a really significant problem to solve,” says Scott Anderson, Couchbase’s senior vice president of product management. “You have this incredibly rich data structure, nested objects and so forth. How do you derive the intelligence beyond what we do with NIQL and how do we extend the capacity, and then solve some of the problems that Ravi talked about, which is the fragility of the pipeline?”
Couchbase executives fretted that they had hid too much of the complexity that went into Analytics Service and made it seem too simple in the demo. That led Mayuram to warn the attendees: “Don’t try this at home. Don’t think that you can solve this problem by hodge podging three different open source projects together. That’s not the case here….This is some serious stuff that we’ve built here.”
September 28, 2021
- Oracle Introduces Next-Generation Exadata X9M Platforms
- Qlik Introduces Qlik Application Automation
- Fighting Fire with Data Science: UCSD Announces Joint Appointment with Los Alamos
September 27, 2021
- TIBCO Delivers a Comprehensive, Connected Platform for the Adaptable Digital Business
- The World Economic Forum Welcomes Western Digital to Global Lighthouse Network
- LevaData Introduces New Suite of Supply Management Software
- KNIME Data Talks: Bringing Business and Data Science Together; Set for September 29
- BriefCam Introduces Video Analytics Enabled on Deep Learning Cameras from Axis Communications
September 24, 2021
- AWS Announces General Availability of Amazon QuickSight Q
- IDC’s 3rd Platform Industry Spending Guides Provide In-Depth Sub-Industry Forecasts for Technology Investments Across Nine Industries
- Scality Awarded US Patent for Hyperscale Data Protection
September 23, 2021
- AtScale Expands Semantic Layer Solution for Microsoft Excel
- CNCF End User Technology Radar Provides Insights into DevSecOps
- At Annual OCEANS 2021, Sofar Ocean Debuts First-of-Its-Kind Maritime Open Standard, Bristlemouth
- Elastic Announces the General Availability of Elastic App Search Web Crawler, New Features for Elastic Enterprise Search
- Securonix Achieves FedRAMP In-Process Authorization
- EDJX and Cubic Corporation Partner to Launch the Internet of Military Things Edge Platform
September 22, 2021
- GigaOm Names Moogsoft an Industry Leader in “Radar for AIOps Solutions” Report
- Clearsense Acquires Plug-and-Play AI Analytics Firm
- Purdue University Global Launches Master of Science in Data Analytics
Most Read Features
- One on One with Google Cloud Product Director Irina Farooq
- Big Data File Formats Demystified
- What Is Data Science? A Turing Award Winner Shares His View
- Tabular Seeks to Remake Cloud Data Lakes in Iceberg’s Image
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- SambaNova Brings Custom Silicon To Bear on High-End AI Workloads
- Who’s Winning In the $17B AIOps and Observability Market
- How the Coronavirus Response Is Aided by Analytics
- In Search of the Modern Data Stack
- Rethinking Education in an AI-First World
- More Features…
Most Read News In Brief
- LinkedIn Open Sources Tech Behind 10,000-Node Hadoop Cluster
- Data and AI Salaries Continue Upward March, O’Reilly Says
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- Gartner Shuffles the Technology Deck with Latest ‘Hype Cycle’ Report
- Who’s Winning in Open Source Data Tech
- Bigeye Observes $45 Million in Funding
- Why Is SAS Going Public?
- Hands-Off: Manual Data Integration Tasks Plummeting, Gartner Says
- Unstructured Data Growth Wearing Holes in IT Budgets
- Apollo CEO Bullish on GraphQL’s Potential in the Enterprise
- More News In Brief…
Most Read This Just In
- TIBCO NOW 2021 Showcases Limitless Power of Data
- Toloka Launches Data Research Grants, Announces First Eight Recipients
- Anaconda Announces Support for Pyston, Hiring Lead Developers Kevin Modzelewski and Marius Wachtler
- Kinetica Fuses Streaming and Contextual Analysis At Scale
- MariaDB Announces SIS Provider Campus Cloud Services Migration to MariaDB SkySQL
- Transaction Processing Performance Council (TPC) Launches an Artificial Intelligence Benchmark (TPCx-AI)
- Aporia Launches Self-Serve Machine Learning Platform Open to Public
- Snowflake Launches Financial Services Data Cloud
- OneTrust Enhances First-Party Data Solution to Strengthen Holistic Consent and Preference Management Platform
- DataRobot Launches “DataRobot AI Cloud” Platform
- More This Just In…
Sponsored Partner Content
October 5 - October 7
October 12 - October 14
October 19London United Kingdom
October 27 - October 28
November 29 - December 3
December 6 - December 10San Diego CA United States