Follow Datanami:
March 4, 2020

Real-Time Data Streaming, Kafka and Analytics Part 3: Effective Planning for Data Streaming Improves Data Analytics

Thornton Craig, Dan Potter, and Tim Berglund

Data stream processing is defined as a system performing transformations for creating analytics on data inside a stream. In Part 1 of this series, we defined data streaming to provide an understanding of its importance. In Part 2, we got a bit more technical in explaining data integration and ingestion into one of the more popular stream platforms, Apache Kafka. This final piece will explore the benefits of data stream processing with Kafka, as well as how to best plan for implementing data streams.

Companies who wish to take advantage of real-time data streams for analytics require a modern data architecture. This infrastructure can be managed through the emerging DataOps, which applies principles of lean manufacturing, DevOps and agile software development to data pipeline management. By leveraging DataOps, companies can implement a fully governed data management strategy that promotes team collaboration and use of data for business-driven data analytics. A direct benefit is that as individuals work together on data analysis, they gain greater understandings of the information and raise their own data literacy.

Key technologies behind a DataOps strategy are the same solutions we discussed in Part 2: Change Data Capture (CDC) and Apache Kafka. Real-time data streaming with Apache Kafka is very efficient as it enables integration and in-stream analytics on the data as it moves through the stream. It’s clear that CDC is seen as the optimal mechanism for capturing and delivering transactional data into a streaming platform. Once there, Apache Kafka and other components can share the data and analysis on it as well.

Benefits of Using Apache Kafka

Apache Kafka is a preferred method for streaming due to its three core components. The Streams API supports building applications or microservices that are read from and written back to Kafka. KSQL, which is similar to SQL, is the language used to describe stream processing data operations in Kafka. And finally, windowing allows the use of time-based constraints to determine what subset of records is being viewed. The combination of these three components creates a unique framework that provides many benefits, including:

  • Exactly-once: Through the Streams API, Kafka ensures that all messages and data are processed just once even if there are duplicates, meeting efficiency and latency requirements.
  • Stateful and stateless: Apache Kafka will process each individual piece of data without referencing other messages or can aggregate this data based on the history of the information in the stream.
  • Time: The Streams API supports windowing and other time-based joins and aggregations to create greater data availability and efficiency.
  • CDC: This technology uses log-based extraction to create new streams and perform in-stream analytics.
  • Real-time: Streaming happens in real-time windows on real-time data, one record at a time. By eliminating batch processing, companies are assured of having the most recent information for data analytics.

Stream Processing with CDC and Kafka

Implementing a new data architecture is not something that can happen overnight. Rather, companies succeed through effectively planning, designing and building these infrastructure changes. Without planning to implement efficient and scalable processes, chaos will ensue and data integrity will be lost.

When starting out, companies should look to implement five high-level attributes that span the DataOps methodology of people, processes and technology.

These include:

 

  1. Effective data pipeline design: When designing pipelines, companies should identify processing failures to fix before they grow too big. Additionally, checking for data integrity and publishing transaction record inserts, updates and deletes, and source schema changes will ensure that data is effectively integrated and replicated.
  2. Standardized process: Its critical to create clear guidance on how to operate, such as how to do planning, design, data quality, testing and production. A standard process will ensure that data teams meet business requirements and complete data analysis and visualizations on time.
  3. Shared resources: Each business department will have different use cases, resource requirements and usage patterns. Supporting them with pooled resources, potentially cloud-based, can be more economical and improve infrastructure utilization.
  4. Monitoring: Companies must monitor key metrics including state, memory, throughput, latency, number of partitions and lags in creation of topics to ensure they meet business SLAs.
  5. Data governance: Compliance has several aspects such as ensuring data producers have the right accountable parties for accuracy, along with profiling tools, quality checks and more. Companies will need to ensure that GDPR regulations are followed and all data use is reported to the appropriate internal compliance officer.

Importance of Data Stream Processing

Throughout this article series, we examined the importance of real-time data streaming and processing in modern data architectures. It’s clear that stream processing enables a number of information initiatives including real-time analytics, microservices integration, log analysis and data integration.

Firms looking to fully leverage data stores for better business decision making need to implement a modern data framework, perhaps following the DataOps methodology, which should include CDC and Apache Kafka technologies. It’s through these strategies that firms will enable individuals to strengthen their data literacy and data-driven collaboration.

Tim Berglund

If you are interested in more details on transaction data streaming, along with suggestions on how to best implement Apache Kafka and create organized data streams with CDC, there is a free Dummies book, Apache Kafka Transaction Data Streaming for Dummies, that provides greater detail.

About the authors: Thornton Craig, a senior technical manager with Amazon Web Services, has spent more than 20 years in the industry, and previously served as research director at Gartner. Dan Potter, the vice president of product marketing at Qlik (formerly Attunity), also has 30 years experience in the field, and is currently responsible for product marketing and go-to-market strategies related to modern data architectures, data integration, and DataOps.  Tim Berglund, the senior director of developer experience at Confluent, is a teacher, author, and technology leader.

Related Items:

Real-Time Data Streaming, Kafka,and Analytics Part 2: Going Beyond Pure Streaming

Real-Time Data Streaming, Kafka, and Analytics Part One: Data Streaming 101

Datanami