Follow Datanami:

Spring Strata 2016 Coverage


The Real-Time Rise of Apache Kafka

Shiny new objects are easy to find in the big data space. So when the industry’s attention shifted towards processing streams of data in real time–as opposed to batch-style processing that was popular with first-generation Hadoop–we saw dozens of promising new technologies pop up seemingly overnight. One of them was Apache Kafka.

The interesting thing is that Kafka wasn’t actually new. Jay Kreps started writing the software, which serves as a messaging layer for moving data, when he worked at LinkedIn in 2008, and the software was contributed to open source in 2011. Read more…

Feature Articles from Spring Strata 2016

Reporter’s Notebook: 6 Key Takeaways from Strata + Hadoop World


The big data ecosystem was on full display at last week’s Strata + Hadoop World conference in San Jose. At the ripe old age of 10, Hadoop is still the driving force, but newer frameworks like Spark and Kafka are gaining steam. Here are some of the top trends your Datanami editor pulled from the show based on observations and discussions with attendees and vendors. Read more…

Cutting On Random Digital Mutations and Peak Hadoop


In a wide-ranging Strata + Hadoop World talk on Wednesday that reminds us why we like Doug Cutting so much, the father of Hadoop riffed on the evolution of big data tech, the power of open source, the promise of Flink, and the possibility of “peak Hadoop” at the ripe old age of 10. Read more…

Elastic Gives Search Engine a Graph Option


Elastic today announced that it’s added a graph query engine to Elasticsearch engine users now have the option of using their search indexes as the basis for conducting graph analyses. The new option will make it relatively easy for customers to conduct big data analysis for use cases such as fraud detection and product recommendations. Read more…

Apache Flink Creators Get $6M to Simplify Stream Processing


Real-time stream processing is one of the hottest topics this week at Strata + Hadoop World, and one of the new frameworks turning heads is Apache Flink. Developed by the German company data Artisans, Flink is unique in that it aims to simplify the big data analytics stack with “streaming first” Read more…

For Data Scientists, What’s in a Name Really Matters


Shakespeare once pondered the nature of names, pointing out that “a rose by any other name would smell as sweet.” For data scientists, the meaning behind the title is not just an epistemological exercise, but a practical problem that has consequences upon that delicate dance between employer and employee.

The data scientist shortage is having all kinds of impacts on how organizations approach big data projects. Read more…

Finding Long-Term Solutions to the Data Scientist Shortage


As we learned in the first part of this series, the gap between demand for skilled data scientists and supply is driving salaries north of $200,000 in some areas of the country. If big data analytics is to be democratized, steps must be taken to ensure that this short-term misalignment doesn’t turn into a long-term problem. Read more…

ODPi Defines Hadoop Runtime Spec; Operations Up Next


Today the ODPi issued the first set of documents that describes a standard distribution of basic runtime components for Hadoop, including YARN, HDFS, and MapReduce. Going forward, the organization is preparing a management specification for Hadoop as it considers which Hadoop problem area it will tackle next.

The ODPi was founded a year ago  on the eve of the Spring Strata + Hadoop World conference as the Open Data Platform initiative to help reign in some of the complexity that’s impacting Hadoop distributors, software vendors, and users. Read more…

Tracking the Data Science Talent Gap


If your company is looking to hire data scientist right now, good luck. Five years after Harvard Business Review first shone the spotlight on the data scientist shortage, the gap between data science supply and demand remains substantial. In fact, the gap may be getting bigger.

How big is the data science skills gap? Read more…

News in Brief from Spring Strata 2016

Spark Leads Big Data Boom, Researcher Says


The global big data market is poised to explode over the next decade, according to a new forecast, topping an estimated $92 billion by 2026 as new streaming analytics technologies emerge.

Market researcher Wikibon said this week it expects the global demand for big data services to grow at a hefty 14.5 percent annual rate over the next decade. Read more…

AI Services Firm Bonsai Wins Strata Startup Showcase


An artificial intelligence (AI) startup out of Berkeley, California called won the Startup Showcase at the Strata + Hadoop World today. The second and third-place winners were also announced, as was the winner of the audience choice award.

Bonsai ( has created a cloud-based platform where users can build and deploy AI services, as well as a marketplace where the services can be bought and sold. Read more…

Architecting Immediacy-The Design of a High-Performance Portable Wrangling Engine


At Strata + Hadoop World San Jose this week, I will present with my fellow Trifacta colleague, co-founder Joe Hellerstein, a session entitled “Architecting immediacy: The design of a high-performance, portable wrangling engine.”

A big part of our session will be discussing  our new Photon Compute Framework, an enhancement at the core of Trifacta’s data wrangling interface. Read more…

Upgrades Aid Access to Legacy Data


The latest release of a No+SQL database management platform adds integration capabilities for legacy COBOL and Btrieve systems designed to allow users to update the data management engine underneath their existing applications.

Noting that a significant number of financial and other users continue to rely on legacy systems based on COBOL and Btrieve transactional database software, database specialist FairCom Corp. Read more…

MemSQL Pushes HTAP Ball Forward


MemSQL used the second full day of the Strata + Hadoop World conference to launch a new version of its distributed SQL database that pushes forward its hybrid transactional/analytical processing (HTAP) strategy, which is gaining steam across the industry as a blended form of computing.

MemSQL is part of a new class of in-memory, horizontally scalable, relational databases that are gaining momentum for the capability to ingest and analyze large amounts of data in near real time. Read more…

Deploying Hadoop on User Namespace Containers


Hadoop is increasingly moving to the cloud, with the Gartner group reporting that over 50% of companies are considering a cloud only or hybrid cloud solution for Big Data. Altiscale has been offering a high-performance, secure, multi-tenant cloud solution since 2014, with its multitenancy and performance capabilities driven by the use of namespaced Docker containers. Read more…

Platfora Seeds Big Data Future in Openness


Platfora launched its end-to-end analytics application for Hadoop when the only other option was to build your own. To that end, Big Data Discovery has everything you need. But with today’s update to the tool–issued on the  second day of the Strata + Hadoop World conference–Platfora is opening up the kimono a bit more in an effort to better integrate with popular tools in the ecosystem, namely Tableau and Spark SQL. Read more…

Transactional Streaming? You Need State


The distinction between traditional operational systems and event/stream processing has begun to blur. Stream-oriented approaches offer novel ways to build applications that yesterday would have used a more traditional stack, such as LAMP or something similar. Rather than have monolithic clients fetch, process and update data over a network, developers are building pipelines that push data through fixed processing. Read more…

BI on Hadoop–What Are Your Options?


In the era of RDBMS and modern data warehouses, business intelligence was mostly a solved problem. Any reasonably advanced tool would work with any reasonable database, and the only real work was deciding what to collect and how to present it. However. the rise of big data and its associated technologies has forced the market solve all these old problems all over again, and we’re now left with a proliferation of software that can be difficult to differentiate. Read more…

Resolving Hadoop’s Storage Gap


Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds.

With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. Read more…

Streaming Architecture–Why Flow Instead of State?


The way that computing is done is changing dramatically.  Instead of a program with a finite input, we now have programs with infinite streams as inputs. Why does this matter, and why is the change happening now?

This matters because life doesn’t happen in neatly defined batches. Neither should your code. Read more…

Distributed Stream Processing with Apache Kafka


We’re only three months into 2016, but it has been an exciting year in open source and big data. With a marked jump in growth, usage and queries on Apache Kafka (Redmonk), the demand for engineering and DevOps jobs requiring Kafka talent is creating huge demand for training and skills development, as users look to leverage new features and create new deployments. Read more…

Machine-Learning Platform Certified For Cloudera


In the run up to next week’s Hadoop confab in Silicon Valley, vendors are releasing a flock of automation and other tools aimed at beefing up the mainstream data processing framework. Among them is an attempt to incorporate data science with a leading Hadoop distribution via a machine-learning approach.

Boston-based data science automation specialist DataRobot said this week its machine-learning platform designed to fill the data science skills gap has been certified on Cloudera Enterprise 5. Read more…


This Just In from Spring Strata 2016