Follow Datanami:
January 18, 2019

5 Key Data Integration Considerations for Apache Kudu

Irem Radzik

Apache Kudu was designed specifically for use-cases that require low latency analytics on rapidly changing data, including time-series, machine data, and data warehousing. Its architecture provides for rapid inserts and updates coupled with column-based queries – enabling real-time analytics using a single scalable distributed storage layer.

However, the latency and relevancy of the analytics are only as good as the latency of the data. In order to gain the most out of the speed of Kudu, you need to deliver data to it in real-time, as soon as possible after the data is originally created.

Here are the top 5 considerations when choosing a data integration solution for Kudu.

1. Does it support the data sources you need for the core use cases with real-time, continuous ingestion capabilities?

In order to provide holistic analysis of enterprise data, it is crucial to be able to ingest all relevant information, regardless of the source. If there are gaps in data sources or data types supported by your integration software, you may be limited in the content that can be fed  into Kudu. This can lead to partial or, even possibly, misleading insight from your reports or analytics applications.

Having a real-time ETL solution that can ingest a wide range of dynamically changing data is a critical step in gaining value from your fast analytics platform. Whether the data is coming from databases, machine logs, applications, or IoT devices, it needs to be collected in real-time, micro-to-milli seconds after its genesis. This means utilizing the right technology and techniques to achieve very low-latency continuous data collection independent of the source.

2. Does it support your existing schemas, especially for the reports that you want to keep working while allowing for new use cases without extensive coding?

Real-time ETL is not only about ingestion related performance and resulting low latency, but also processing speed and flexibility. Having a solution that can efficiently perform a wide range of transformations, filtering, aggregation, masking, and enrichment to prepare the data for your existing schemas, will enable you to continue to support your current end users while possibly adding in new applications.

Especially for high-velocity and sensitive data, these data preparation steps need to take place before the data is delivered to your analytics environment so you can avoid introducing latency, optimize storage in Kudu, reduce on-disk processing workload, and fully comply with government regulations.

Not all use cases need milliseconds speed, but if you start going above a minute, in many use cases, you may decrease the chances of getting actionable insights from your high-velocity data. Kudu is completely designed for very low-latency use cases. Data is only available for querying after it has been processed and written to Kudu, so the ETL process should add as little latency as possible between creation and delivery to Kudu.

When real-time ETL for Kudu can perform the processing in-memory, while the data is in motion, it can scale to handle large volumes without introducing latency and will accelerate the time to insight. It also simplifies the overall data architecture, enabling end-to-end recoverability and full resiliency.

3. Is it secure and reliable in processing, including enabling exactly once processing (E1P) and delivery?

Security is always a key consideration and becomes more problematic if the real-time ETL involves multiple products. Enabling end-to-end security across various components can require a lot of effort to reliably meet your strict data security requirements. Choosing a real-time ETL solution with built-in end-to-end security saves you from significant risks and costs.

The same applies to reliability. Can you trust the insight from your fast analytics if the processing or delivery contains duplicates or misses data? Mission-critical use cases require full trust in the data you put in.

As we all know: Garbage in, garbage out.  It becomes much more difficult to guarantee E1P if there are time windows involved, for example in aggregations. That’s why choosing a platform that automatically recovers after an outage without manual intervention will save you development and maintenance costs, as well as prevent inaccurate conclusions and actions.

4. Is the coding language accessible to all the groups that need to be involved?

The longevity of any solution is also dependent on how widely it is used and accepted within the organization. When a data integration tool is easy to understand and easy to use for different groups, especially for the business teams, its adoption will likely be faster.

While Java, Scala and other coding languages are popular for analytics solutions, using a SQL-based language to process the data will support that expansion to a large set of users. It also reduces the stress of maintaining harder-to-find skill sets to support the solution.  In addition, providing the end users with an intuitive UI, in addition to command line options for power-users, will increase development productivity for this broad user group.

5. Does it provide you the flexibility to move the data to new targets, especially in the cloud?

Technology requirements change. While you might have a specific endpoint like Apache Kudu in mind today, new requirements may dictate other technologies in the future.

(Khakimullin Aleksandr/Shutterstock)

The flexibility to supply raw or processed data to other on-premises and cloud targets simplifies, and future-proofs your data architecture. Furthermore, for many data sources, especially database change data capture, reading the same data multiple times can add unacceptable overhead to source systems.

When looking at real-time ETL solutions, you should consider whether data read from a single source can be delivered to multiple targets simultaneously. These targets could include Kudu, as well as Kafka for real-time data distribution, and cloud technologies for elastic scalability.

When you feed pre-processed data from your high-velocity data sources to Kudu in real-time using a secure, reliable, and easy-to-use solution, you can gain the maximum benefit from your fast analytics applications with the least amount of effort. 

About the author: Irem Radzik leads product marketing at Striim. Before working for Striim, Irem was the Director of Product Marketing for Oracle Cloud Integration product group. Irem has more than 18 years of product management and marketing experience in enterprise software, financial services and consulting industries—with a focus on data and application integration, and business analytics technologies. She joined Oracle with the acquisition of GoldenGate Software in 2009. Before GoldenGate Software, Irem worked at Siebel Systems (now part of Oracle), TIBCO Software and Enkata Technologies. She holds an M.B.A. degree from the University of Pennsylvania, Wharton School of Business.

Related Items:

Cloudera Unveils Kudu, a Fast New Storage Option for Hadoop

The Hybrid Database Capturing Perishable Insights at Yiguo