Follow Datanami:
June 3, 2019

Cassandra, Kafka Help Scale Anomaly Detection

The scaling of open-source platforms continues apace with the demonstration of an anomaly detection application capable of processing tens of billions of events per day using Apache Cassandra and Apache Kafka running on the Kubernetes container orchestrator.

Instaclustr, the “open source-as-a-service” startup reported this week its anomaly detection data pipeline processed and vetted 19 billion real-time events in the space of a day. The demonstration illustrates the ability to massively scale applications like anomaly detection when Cassandra and Kafka are fine-tuned, the startup said.

Anomaly detection is used to spot unusual events in streaming data, frequently indicating security threats or suspicious activity. It is used in applications such as financial fraud detection, security, threat detection, website user analytics, sensors and Internet of Things deployments. The application works by comparing vetted streaming data against historical event patterns, “raising alerts if those patterns match previously recognized anomalies or show significant deviations from normal behavior,” the startup said.

Instaclustr, Redwood City, Calif., said it combined Cassandra, Kafka and the anomaly detection application in a Lambda architecture, with Kafka acting as the “speed layer” and Cassandra as the batch and serving layer. Kubernetes on Amazon Web Services’ (NASDAQ: AMZN) Managed Kubernetes Service was used to automate application provisioning, deployment and scaling.

Detection systems stacks often include machine learning, statistical analysis and algorithm optimization. The stack leverages data-layer technologies to ingest, process, analyze, disseminate and store streaming data.

The architecture must be capable of handling daily volumes running into the billions of events. “In these scenarios, data-layer technologies must overcome substantial computational, performance and scalability requirements in order to cope with the massive scale of events,” the company noted.

A white paper explaining Instaclustr’s open-source data layer approach is available here. The open-source code is available on GitHub.

Instaclustr is among a growing list of companies focused on clustered open-source software services applications hosted in the cloud. Anadot, which used machine learning techniques to spot anomalies in time-series data, unveiled an application last fall built on Cassandra, Kafka and other open-source tools.

Cassandra was used as the real-time data store while Kafka was utilized to feed data into Anadot’s autonomous analytics framework.

Recent items:

Anadot Gains Patents for Anomaly Detection

Open Source is Now a Big Data Service

Elastic Adds ‘One Click’ Anomaly Detection to Stack