You don’t have to look far to find someone with a “big data” story to tell, but if you ask Fortune 100 companies and the vendors that pitch to them, the speed component of that data tale is one worth telling.
The performance angle to analyzing massive data sets is a matter already being addressed by a robust big data ecosystem, says Matt Quinn, CTO of TIBCO Software. However, throw in the tricky topic of data diversity and variety, and moving big data at record speed requires a “rethink” of everything from hardware, to analytical frameworks, to the non-traditional databases that are reshaping the field.
Quite simply, big fast data is a problem of data variety as much as movement. This means a focus on the frameworks for ingesting and analyzing data are just as ripe for new development as the network itself.
Quinn claims that finding ways to let users scale out to accommodate massive datasets without adding operational complexity is an area of research and development at TIBCO that arose due to customer needs. He points out, however, that the index of what massive really means is not necessarily size-based. Rather, it deals with the variety issue. If the data is self-similar, scalability does not pose an overly complex problem. But when that data is complex and variable, users end up with consistency challenges.
Blending data consistency, management and performance boosts creates a new layer of operational complexity and increases the technical debt IT departments are already paying as they continue to scale out with further management and other costs, agues the TIBCO tech lead. “Even when you look at the different things Hadoop, TIBCO, and others have done with non-traditional approaches to databases, you still have to deal with the consistent problem to get results right. There have been many layers bolted onto frameworks like Hadoop, for example, but it makes things more brittle—not to mention slower.” “What we’ve started to discover,” says Quinn, “is a whole lost art in the industry to analyze these massive streams of data and make sense of it all.” The trick is not so simple, of course, especially when speed is of the essence and those data streams do not fit into the nice structured systems that worked so well in the past.
The company’s R&D efforts are extending to include approaches that they say can boost performance (while eliminating some technology debt overhead) by focusing on the network layer. At the heart of their work here is in-memory technology, which is nothing new—even though it’s being reworked to accommodate the needs of big data. According to the TIBCO CTO, “In-memory data grids (IMDGs) performed well n a variety of niches (CEP, event correlation, real-time aggregation, etc.) but we’re seeing a new trend that could reshape this model. As the price of memory has dropped and machines can handle terabytes in RAM, it is becoming more useful to move the data into direct access memory rather than using spindles or even Flash.
Quinn says that the big trend with in-memory data grids and an area of research for TIBCO, which aligns with their ActiveSpaces product, is based in the network. “Unless you can store all the data on one physical machine, you now have to deal with the variance and latency introduced by the network. We’re trying to bring the data grids even closer to the network to reduce the time it takes to distribute information across large numbers of machines.” Quinn is hinting here at the company’s recently announced Faster than Light (FTL) switch, which is their first step in moving the messaging chunk of their stack close to the network—essentially inside the switch by embedding their FTL messaging technology inside one of the partner-supplied switches. This means users can take their own application code and load that into the switch so the messaging layer can paddle around all of the messaging, hand it over to the application, and allow the user to avoid leaving the network. While this has clear appeal for TIBCO’s financial services bread and butter customers, Quinn sees the need for big fast data with FTL and related approaches moving far beyond this arena.
To highlight the changing needs of big data operations, he pointed to a recent example of a TIBCO client who said that if the data can’t be processed in 15 minutes or less, it’s junk. The size wouldn’t allow them to store it, even offline. What this means is that, as Quinn puts it, “The half-life of data usefulness is coming down—and this presents a new array of challenges.” The challenge is moving away from size and into the realm of high performance, not dissimilar to some of the needs of HPC users in financial services for example (an industry where TIBCO has had a foothold since its pre-IPO days).
On that side note, the CTO sees how the hardware and software needs of high performance computing are being pushed into the business analytics space (and vice versa). “If you look at the HPC world, the challenges there have really been about taking a really big problem, splitting it up into many small things, then distributing that load across many machines” says Quinn, pointing to Monte Carlo simulations as a good example. However, when one “looks at the last three to four trends with technologies like Hadoop, these same conclusions about harnessing massive parallelism are being reached” albeit outside of what might look like HPC.
To put this in perspective, Quinn shared his view of the data-centric versus process-centric approaches to handling massive, complex data. On the one end, there is the traditional HPC approach which he says is less data-centric since it’s essentially about passing around small chunks of data to a large number of machines that will crunch it from a compute point of view. This approach is heavily weighted toward algorithms, he adds. Conversely, the Hadoop challenge was less about processing and more about data size because in some cases it was too bulky to move, thus it made more sense to process it in place. The performance-concerned folks didn’t have to deal with data in the same way since they were passing it around whereas the Hadoop/data-centric approach didn’t have to deal with the sophisticated algorithms that made the compute-heavy HPC operations complicated. However, these spaces are quickly merging, he argues.
On the data-centric versus process–centric front, he notes that, “Interestingly enough the statistical modeling and languages have come back again in the broader context of these massively distributed systems and it’s causing some problems with customers on the learning curve side.” In other words, the problems of HPC on both the hardware and software side, are entering the practical world of big data as companies are looking beyond Hadoop for something that yields better performance—but can easily be tested using the “fail fast” approach. Quinn added that, “Outside of financial services and scientific research, the combination of these things hasn’t been seen yet—pulling all these things together requires a lot of tooling to make it accessible and functional with the same fervor around user experience that’s been found in other areas.”
Quinn claims that while speed and performance have always been top priorities at most organizations, early adopters of what is now this big fast data segment were inspired by Wall Street’s move out of terminals and into the digital age. When these firms turned digital, their first issues revolved around the “simple” need to deliver the right information to the right location for traders to execute. Although this general problem hasn’t changed much, the big difference is that there’s not a person on the other end—there’s an application. This same model, minus the traders as the user, can be turned around to fit nearly all companies that require speed-to-decision on big, often convoluted, data sets.
Moves like this across an ever-broadening swarm of industries have produced an entirely new set of challenges wrapped in more general, classic IT problems (data integrity, movement, management, etc.). The ecosystem is very noisy around big data but Quinn says it was the same thing with cloud and SOA in the past as well. He sees this, as well as the open source solutions around big data to be a positive influence for the industry as a whole as new solutions keep pushing the technology envelope. While he admits that too many cooks in the big data kitchen could cloud the message, he is confident that the marketplace—just as it always has—will whittle down its winners and losers via the natural selection process of experimentation.