The Critical Element for a Successful Digital Transformation? HTAP Powered by In-Memory Computing
Many of today’s digital transformation and omnichannel customer experience initiatives demand real-time analysis of data. For example, banks need to analyze transactions across their systems in real time to detect and prevent fraud. Healthcare providers must constantly analyze data being collected from hundreds or thousands of in-hospital and home-based patients to respond in real time to individual crises and detect possible disease outbreaks. Shipping companies are looking to continually analyze sensor data from delivery vehicles and vessels to predict maintenance requirements in order to increase utilization and reduce costs. And retailers want to analyze purchases in real time to update inventory plans, optimize recommendation engines, and even adjust pricing in real time.
To create these applications, most organizations need to rethink their application infrastructures and create architectures that support the required performance and scalability. Companies facing this challenge usually find that using an in-memory computing (IMC) platform to power hybrid transactional/analytical processing (HTAP), also known as hybrid operational/analytical processing (HOAP) or translytical processing, is the most cost-effective strategy for attaining the real-time performance and scalability these applications require. They may further find that the data in these systems must be augmented by a subset of the data stored in their data lake in real-time in order to deliver a comprehensive view of their business processes.
The Challenge of HTAP
HTAP is the ability to support transaction processing and analytics on the operational dataset without impacting performance in order to drive real-time decision making. However, implementing HTAP architectures for large datasets hasn’t been easy.
In the earliest days of computing, both transaction processing and analytics were performed on the same database. But as datasets grew larger, analytics queries could significantly slow transaction processing and bog down or even lock up applications. To address this problem, companies started deploying separate transactional (OLTP) and analytical (OLAP) databases. This meant data from the OLTP system had to be periodically (daily, weekly, etc.) extracted, transformed and loaded (ETL) into the OLAP system. This separation worked acceptably for several decades because real-time analytics on the operational database wasn’t required to remain competitive.
All that has changed. Today’s digital transformation and omnichannel customer experience initiatives are intended to drive real-time business decision making. They require the analysis of massive amounts of data in real-time. Using the traditional bifurcated OLTP and OLAP architecture approach doesn’t work because analyses run on the OLAP data are always being performed on stale data due to the ETL lag. Separate OLTP and OLAP databases also present an additional investment for companies. To support separate OLTP and OLAP systems, IT must build and maintain separate architectures, typically on separate technology stacks. This results in separate hardware and software costs for each system, as well as higher costs for personnel to build and maintain them.
HTAP enables real-time analysis on the live operational dataset, but it requires unprecedented levels of speed and scalability. In certain cases, HTAP systems can even be extended to provide real-time analyses on both operational data as well as a subset of archived data from a data lake. For most companies, the only cost-effective way to achieve the performance and scale of an HTAP architecture is with an IMC platform.
In-Memory Computing Platform Capabilities
In-memory computing isn’t new, but until recently, the cost of memory limited its adoption. Today’s memory prices make IMC a practical option for companies across all industries. In addition, the availability of proven open source IMC platforms and skilled personnel that understand how to leverage them further reduces the cost of implementing IMC technology.
Modern IMC platforms distribute data across the RAM of a cluster of commodity servers in order to process transactions and analyze data without the constant delays of continually reading and writing data from a disk-based database. These solutions may also be architected to collocate compute with the distributed data to reduce or eliminate the need to move data across the server network while enabling massively parallel processing across the cluster.
All told, application performance may be increased by 1,000x or more. The IMC platform can easily and cost-effectively be scaled to petabytes of in-memory data by adding servers to the cluster. The cluster automatically redistributes data to take advantage of the added RAM and CPU processing power of the additional nodes. The distributed architecture of an IMC platform also provides high availability when the data is replicated across nodes of the cluster.
Today’s open source IMC platforms offer a range of capabilities. These capabilities may include:
- An in-memory data grid (IMDG) that can be inserted between existing application and data layers with no rip-and-replace, providing massive performance and scalability gains to existing applications.
- An in-memory database (IMDB) that can be used to build modern, scalable, highly performant new applications.
- Support for distributed ACID transactions and ANSI-99 SQL.
- A memory-centric architecture that enables users to balance infrastructure costs and application performance by keeping the full operational data set on disk while keeping all or only a subset of user-defined data in memory.
- Support for real-time analytics on joint data lake and operational data leveraging native integrations with solutions such as Apache Hadoop and Apache Spark.
- A continuous learning framework that can be deployed using integrated, fully distributed machine learning (ML) libraries that have been optimized for massively parallel processing.
- Native integrations to deep learning (DL) platforms.
- A stream processing platform for publishing and subscribing to streams of records, storing streams of records in a durable way, and processing streams of records as they occur.
- The ability to easily deploy the in-memory computing platform using containers such as Docker and manage the deployment using solutions such as Kubernetes.
IMC and HTAP
An IMC platform can provide the performance and scalability to support HTAP, with the entire dataset – or a subset of data required for a particular use case – in RAM, ready for transactions and analytics processing. If compute is collocated with the data across the nodes of the cluster, performance will be much higher than if the data in RAM must constantly be moved across the network to a central server for processing. An HTAP architecture eliminates the need for separate transactional and analytical databases, creating the unified environment businesses need today for real-time performance at scale.
Thanks to the growing adoption of IMC, Gartner has predicted that by 2020, IMC will be incorporated into most mainstream products. However, not all data analytics can or even should be performed on an IMC-powered HTAP system. Highly complex, ad hoc, long-running queries typically should be run in an OLAP system. Some analytics don’t drive real-time decision making so running those analytics periodically is adequate. Still, IMC-powered HTAP provides businesses with a vital new ability to understand and react in real-time to their business environment, leading to a more competitive business with real-time situational awareness.
ROI and Cost Savings
ROI is derived from the improvement in business outcomes resulting from the improved performance of the use cases. In addition, an HTAP solution built using an open source IMC platform can reduce startup and development costs, yet be backed by a company that provides enterprise-ready software versions and the support necessary to deploy and manage the platform in a mission-critical, production environment with strict SLAs.
A distributed in-memory computing system that can be deployed on commodity servers significantly reduces infrastructure costs versus traditional approaches built on proprietary, expensive hardware and software. This architecture, which eliminates the need for separate OLAP and OLTP systems, also reduces costs and complexity by eliminating the need for separate databases, hardware and skill sets for analytics versus operations. For use cases in which hot data resides in the operational system and cold data is stored in a data lake, smart integrations can power real-time business decision making based on combined data lake and operational datasets.
An IMC-powered HTAP architecture can provide another major cost savings for companies developing AI applications. Deep learning is very compute intensive and is frequently run on specialized hardware. TensorFlow, an open source project originally created by Google, is a popular deep learning platform. However, companies using TensorFlow today typically create a data repository, often built on Hadoop, and then periodically ETL their operational data into the Hadoop system before then loading it into TensorFlow for deep learning model training.
Some IMC platforms now offer native integration with TensorFlow, enabling companies to send the data in the in-memory computing platform directly to TensorFlow. This eliminates the need to create and maintain a separate analytical data storage architecture connected via ETL to the operational datastore. This can significantly enhance the effectiveness of the deep learning system while reducing the cost and complexity of operating it.
The competitive need to make real-time decisions based on analyzing massive amounts of data is only going to become more critical. If your company wasn’t born a digital enterprise or isn’t developing this capability today, there is a good chance you won’t survive. IMC-powered HTAP is by far the most cost-effective path to achieving the extreme speed and scalability necessary to support your digital transformation or omnichannel customer experience initiatives.
About the author: Abe Kleinfeld is the president and CEO of GridGain Systems, a provider of an in-memory database platform. Since joining in 2013, Abe has led the company through five years of triple-digit average annual growth, $29 million in Series B venture financing and numerous awards including being named to the Inc. 500 list of America’s fastest growing private companies for the past two years. Prior to GridGain, Abe worked at nCircle, which he helped lead to achieve $40 million in annual sales.