January 18, 2019

5 Key Data Integration Considerations for Apache Kudu

Irem Radzik

Apache Kudu was designed specifically for use-cases that require low latency analytics on rapidly changing data, including time-series, machine data, and data warehousing. Its architecture provides for rapid inserts and updates coupled with column-based queries – enabling real-time analytics using a single scalable distributed storage layer.

However, the latency and relevancy of the analytics are only as good as the latency of the data. In order to gain the most out of the speed of Kudu, you need to deliver data to it in real-time, as soon as possible after the data is originally created.

Here are the top 5 considerations when choosing a data integration solution for Kudu.

1. Does it support the data sources you need for the core use cases with real-time, continuous ingestion capabilities?

In order to provide holistic analysis of enterprise data, it is crucial to be able to ingest all relevant information, regardless of the source. If there are gaps in data sources or data types supported by your integration software, you may be limited in the content that can be fed into Kudu. This can lead to partial or, even possibly, misleading insight from your reports or analytics applications.

Having a real-time ETL solution that can ingest a wide range of dynamically changing data is a critical step in gaining value from your fast analytics platform. Whether the data is coming from databases, machine logs, applications, or IoT devices, it needs to be collected in real-time, micro-to-milli seconds after its genesis. This means utilizing the right technology and techniques to achieve very low-latency continuous data collection independent of the source.

2. Does it support your existing schemas, especially for the reports that you want to keep working while allowing for new use cases without extensive coding?

Real-time ETL is not only about ingestion related performance and resulting low latency, but also processing speed and flexibility. Having a solution that can efficiently perform a wide range of transformations, filtering, aggregation, masking, and enrichment to prepare the data for your existing schemas, will enable you to continue to support your current end users while possibly adding in new applications.

Especially for high-velocity and sensitive data, these data preparation steps need to take place before the data is delivered to your analytics environment so you can avoid introducing latency, optimize storage in Kudu, reduce on-disk processing workload, and fully comply with government regulations.

Not all use cases need milliseconds speed, but if you start going above a minute, in many use cases, you may decrease the chances of getting actionable insights from your high-velocity data. Kudu is completely designed for very low-latency use cases. Data is only available for querying after it has been processed and written to Kudu, so the ETL process should add as little latency as possible between creation and delivery to Kudu.

When real-time ETL for Kudu can perform the processing in-memory, while the data is in motion, it can scale to handle large volumes without introducing latency and will accelerate the time to insight. It also simplifies the overall data architecture, enabling end-to-end recoverability and full resiliency.

3. Is it secure and reliable in processing, including enabling exactly once processing (E1P) and delivery?

Security is always a key consideration and becomes more problematic if the real-time ETL involves multiple products. Enabling end-to-end security across various components can require a lot of effort to reliably meet your strict data security requirements. Choosing a real-time ETL solution with built-in end-to-end security saves you from significant risks and costs.

The same applies to reliability. Can you trust the insight from your fast analytics if the processing or delivery contains duplicates or misses data? Mission-critical use cases require full trust in the data you put in.

As we all know: Garbage in, garbage out. It becomes much more difficult to guarantee E1P if there are time windows involved, for example in aggregations. That’s why choosing a platform that automatically recovers after an outage without manual intervention will save you development and maintenance costs, as well as prevent inaccurate conclusions and actions.

4. Is the coding language accessible to all the groups that need to be involved?

The longevity of any solution is also dependent on how widely it is used and accepted within the organization. When a data integration tool is easy to understand and easy to use for different groups, especially for the business teams, its adoption will likely be faster.

While Java, Scala and other coding languages are popular for analytics solutions, using a SQL-based language to process the data will support that expansion to a large set of users. It also reduces the stress of maintaining harder-to-find skill sets to support the solution. In addition, providing the end users with an intuitive UI, in addition to command line options for power-users, will increase development productivity for this broad user group.

5. Does it provide you the flexibility to move the data to new targets, especially in the cloud?

Technology requirements change. While you might have a specific endpoint like Apache Kudu in mind today, new requirements may dictate other technologies in the future.

(Khakimullin Aleksandr/Shutterstock)

The flexibility to supply raw or processed data to other on-premises and cloud targets simplifies, and future-proofs your data architecture. Furthermore, for many data sources, especially database change data capture, reading the same data multiple times can add unacceptable overhead to source systems.

When looking at real-time ETL solutions, you should consider whether data read from a single source can be delivered to multiple targets simultaneously. These targets could include Kudu, as well as Kafka for real-time data distribution, and cloud technologies for elastic scalability.

When you feed pre-processed data from your high-velocity data sources to Kudu in real-time using a secure, reliable, and easy-to-use solution, you can gain the maximum benefit from your fast analytics applications with the least amount of effort.

About the author: Irem Radzik leads product marketing at Striim. Before working for Striim, Irem was the Director of Product Marketing for Oracle Cloud Integration product group. Irem has more than 18 years of product management and marketing experience in enterprise software, financial services and consulting industries—with a focus on data and application integration, and business analytics technologies. She joined Oracle with the acquisition of GoldenGate Software in 2009. Before GoldenGate Software, Irem worked at Siebel Systems (now part of Oracle), TIBCO Software and Enkata Technologies. She holds an M.B.A. degree from the University of Pennsylvania, Wharton School of Business.

The Hybrid Database Capturing Perishable Insights at Yiguo

Applications: Complex Event Processing

Technologies: Middleware, Network, Storage

Sectors: Financial Services, Manufacturing, Retail

Vendors: Amazon, Cloudera, Striim

Tags: Apache Kudu, Data Analytics, data ingestion, streaming data

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

5 Key Data Integration Considerations for Apache Kudu

1. Does it support the data sources you need for the core use cases with real-time, continuous ingestion capabilities?

2. Does it support your existing schemas, especially for the reports that you want to keep working while allowing for new use cases without extensive coding?

3. Is it secure and reliable in processing, including enabling exactly once processing (E1P) and delivery?

4. Is the coding language accessible to all the groups that need to be involved?

5. Does it provide you the flexibility to move the data to new targets, especially in the cloud?

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

5 Key Data Integration Considerations for Apache Kudu

1. Does it support the data sources you need for the core use cases with real-time, continuous ingestion capabilities?

2. Does it support your existing schemas, especially for the reports that you want to keep working while allowing for new use cases without extensive coding?

3. Is it secure and reliable in processing, including enabling exactly once processing (E1P) and delivery?

4. Is the coding language accessible to all the groups that need to be involved?

5. Does it provide you the flexibility to move the data to new targets, especially in the cloud?

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link