HadoopWorld Special: This Week’s Big Data Top Ten
Following Strata HadoopWorld this past week in New York City, we decided that there was no way to narrow down a top five as we generally do each week. Instead, we culled through over 50 releases from the realm of big data to these top ten.
There is, naturally, a Hadoop angle to most of these announcements given the focus of the year’s leading big data conference. This means that included in this list are MapR, Hortonworks, Tableau and many others who are integrating, connecting and expanding upon the open source big data giant.
Let’s get started with the first item—this one from the Hadoop performance side of the spectrum:
MapR and Google Set Hadoop TeraSort World Record
On stage yesterday at a Hadoop workshop in New York City MapR Technologies, Inc. announced the setting of a new 1TB TeraSort benchmark record of 54 seconds using Google Compute Engine.
This record setting TeraSort benchmark broke the one minute barrier with 1,003 servers, 4,012 cores and 1,003 disks as compared to the previous record set by Yahoo. The prior documented record of 62 seconds was set by Yahoo running Apache Hadoop on 1,460 servers, 11,680 cores and 5,840 disks.
“To set this record in a virtualized cloud environment is a testament to Google Compute Engine’s high performance infrastructure,” said M.C. Srivas, CTO and cofounder of MapR Technologies. “This demonstrates the viability of cloud infrastructures for large-scale workloads.”
SAP Bundles Hadoop Power
SAP AG announced integration of Apache Hadoop into real-time data warehousing environments with a new “big data” bundle and go-to-market strategy with Cloudera, Hitachi Data Systems, Hortonworks, HP and IBM.
The offering is based on the flagship SAP HANA platform and combines the SAP Sybase IQ server, SAP Data Integrator software and SAP BusinessObjects business intelligence platforms. It provides a data warehousing product for real-time insights across massive data sets from various sources. The announcement was made at the O’Reilly Strata Conference + Hadoop World, being held October 23-25 as a co-located event in New York.
Where traditional databases once dominated enterprise data warehousing strategies, Hadoop is gaining traction among companies looking for an efficient and affordable way to store and process ever-increasing data volumes. However, companies struggle to integrate Hadoop with their business analytics environments and refined data warehousing practices. Together, the “big data” bundle from SAP and go-to-market partnerships with Hadoop vendors enable SAP to offer its customers a complete real-time data warehousing strategy that harnesses Hadoop’s potential with the speed of in-memory computing and columnar databases.
Mitsui Knowledge Industry have built a real-time analytic database that uses SAP HANA for complex, iterative algorithms against genome sequences preprocessed in Hadoop, reducing genome analysis from days to minutes. comScore uses Hadoop to process terabytes of data each day. The company loads results into its enterprise data warehouse based on SAP Sybase IQ, which can then be analyzed by thousands of comScore customers using self-service tools.
With SAP Data Integrator, organizations can read data from Hadoop Distributed File Systems (HDFS) or Hive databases, and load relevant data rapidly into SAP HANA or SAP Sybase IQ, helping ensure that BI users can continue to use their existing reporting and analytics tools. Furthermore, customers can federate queries across SAP Sybase IQ and Hadoop environments, or alternatively run MapReduce jobs across a SAP Sybase IQ MPP environment using built-in functionality. Lastly, SAP BusinessObjects BI users can query Hive environments giving business analysts the ability to directly explore Hadoop environments.
Rackspace and Hortonworks Collaborate on OpenStack and Hadoop
Rackspace Hosting announced a strategic agreement with Hortonworks with regard to an enterprise-ready Hadoop platform that is easy to use in the Cloud. Together, Rackspace and Hortonworks will focus on eliminating the complexities and time-consuming, manual processes that are required for implementing a big data solution.
The joint effort will pursue an OpenStack-based Hadoop solution for the public and private cloud which can easily be deployed in minutes.
According to a recent Gartner report, big data will drive $232 billion in IT spending through 2016. Big data is enabling organizations to store, process and analyze very large sets of both structured and unstructured data in ways that were not possible or practical until now. However, implementing a big data solution can be complicated and typically requires a large investment in infrastructure, as well as specialized skills and tools that most organizations do not possess.
“Running Hadoop on your own is complex, which is why we’re excited about our development efforts with Hortonworks. We believe Hortonworks as a collaborator brings a substantial advantage in technology, services and experience that will clearly benefit customers,” said John Engates, CTO of Rackspace. “By joining forces, we intend to turn Hadoop into an on-demand service running on the Rackspace open cloud and in clusters on private cloud infrastructure in our data centers or the customer’s data center.”
Rackspace itself is an early adopter of big data and has been using Hadoop since 2008 for mission critical uses: the Emails & Apps division processes billions of emails a year to troubleshoot and diagnose customer issues; the billing department analyzes Cloud usage data daily to generate customer invoices.
Digital Reasoning and Tableau Partner to Analyze Unstructured Data
Digital Reasoning announced a partnership with Tableau Software. Synthesys, Digital Reasoning’s platform for unstructured data analytics, combined with Tableau Desktop 7.0, provides users in government, finance and other markets with a complete solution for addressing today’s fraud, risk, compliance and business opportunity challenges presented by massive amounts of unstructured text.
“We are excited to be working with Tableau on these big data analytics challenges within financial services and the government,” said Tim Estes, Digital Reasoning’s CEO.
By adding Synthesys’ unstructured data analytics, Tableau customers can now include new information sources including email, corporate communications, documents, research reports, social media, and a wide variety of other data sources. Synthesys’s approach to unstructured text analytics includes machine learning, patented algorithms, and adaptable software that learns by example. By combining the power of these two unique solutions, customers will have the ability to uncover fraud or risk that is buried in email, meet compliance obligations where the important data is unstructured, or identify new opportunities by automatically extracting the critical information found in web content, research reports, or company news.
“Tableau is happy to be working with Digital Reasoning to further our mission to help people see and understand data,” said Dan Jewett, Vice President of Product Management at Tableau Software.
Qubole and Simba Deliver SQL Access in Cloud-Based Hadoop
Simba Technologies Inc. announced that it has partnered with Qubole to provide ODBC access to the Qubole platform. Simba’s Big Data ODBC Driver technology will enable Qubole’s users to gain real-time, standard SQL and Hive Query Language (HiveQL) access directly to their Big Data using familiar Business Intelligence (BI) and analytics applications.
Qubole’s data platform is built on an optimized version of Apache Hadoop and Hive that enables users to work with structured and unstructured data all in the cloud.
“Qubole aims to simplify Big Data for organizations by storing it and making it easily accessible in the cloud,” said Ashish Thusoo, Qubole’s CEO. “We chose Simba’s Big Data ODBC Driver technology because it provides full access to Apache Hadoop and Hive data from any standard SQL business intelligence and analytics application.”
ODBC was developed in 1992 by Microsoft and Simba Technologies. The ODBC 3.52 specification – on which Simba’s Big Data ODBC Drivers are fully compliant – is the foundation for standards-based access from BI tools such as Alteryx, Microsoft Excel, Tableau, SAP BusinessObjects Crystal Reports, QlikView and countless others.
Unlike other data drivers for Hadoop/Hive, Simba’s Big Data ODBC Driver technology performs the added task of mapping advanced SQL functionality to HiveQL to provide users with integration between their favored SQL-based BI application and their Hadoop-based big data. Built upon the SimbaEngine SDK driver development framework – the industry’s leading ODBC toolkit for high-performance ODBC driver development – Simba’s cloud ODBC solution for Qubole enables industry-standard ODBC/SQL Big Data access for users and adds important functionality such as Unicode and 32- and 64-bit support for high-performance computing environments on all platforms.
“Qubole’s founders founded and authored Apache Hive,” said George Chow, Simba’s CTo, “built key parts of the Hadoop eco-system and brought Apache HBase to Facebook…Simba’s Big Data ODBC solution for Qubole allows its users full access to their data using the SQL-based tool of their choice.”
Metamarkets Open Sources Real-Time Data Store Druid
Metamarkets announced that it is open sourcing Druid, the streaming, real-time data store component of its analytics platform.
Delivering data analytics and interactive dashboards, the Metamarkets platform is built on a big data stack for processing, querying, and visualizing high volume, high frequency event streams. The data store component, Druid, enables analysis of streaming big data, and is architected as an in-memory, distributed columnar data store. With the Druid data store, Metamarkets’ Software-as-a-Service offering provides sub-second query response times across billions of records.
“Druid is the industry’s first open source, fully distributed analytical data store,” said Metamarkets’ CEO Mike Driscoll. “By sharing the Druid data store with the open source community, we feel we’re contributing a critical missing piece to the big data ecosystem.”
“When we started building Metamarkets’ analytics solution, we tried several commercially available data stores, but they could not deliver sub-second queries at the volumes seen by our online advertising customers — upwards of hundreds of billions of events per month. It became clear that Metamarkets needed to innovate and build our own data store,” said Lead Architect Eric Tschetter. “Now we are excited to see how the open source community will apply Druid to their own applications.”
Metamarkets has engaged with multiple large internet businesses, like Netflix and Riot Games, by providing early access to the code for evaluation purposes. Metamarkets anticipates that a complete open sourcing of Druid will help other organizations also solve their real-time data analysis and processing needs.
Actian, Attunity Tackle Real-Time Big Data Warehousing
Actian Corporation, creators of Vectorwise, and Attunity Ltd. announced today they have joined forces to create Attunity Replicate for Actian Vectorwise.
The platform is designed to provide high-performance, end-to-end data loading with quick time-to-value for Actian Vectorwise environments. The optimized replication solution is immediately available worldwide.
Attunity Replicate for Actian Vectorwise allows enterprise customers to quickly load data from heterogeneous data sources to Vectorwise data warehouses. Attunity Replicate maintains the most current data continuously via change data capture (CDC) technology – streaming changes in real-time from source databases to the Vectorwise data warehouse.
“Attunity Replicate addresses the bottleneck of loading data into Big Data warehouses like Actian Vectorwise,” said Itamar Ankorion, Vice President of Business Development at Attunity “and our partnership brings a solution to market that provides quick time-to-value.”
Attunity Replicate for Actian Vectorwise supports, high-performance full loads and continuous change data capture, Click-2-Replicate graphical user interface for designing and monitoring replication tasks, a wide range of data sources, automatic schema generation and implementation of metadata changes on the target, and integrated change audit trail option contains all change events and can be easily retrieved for access auditing and security.
Continuuity Introduces Big Data Application Fabric
Delivered as a big data application Platform as a Service (PaaS), Continuuity takes advantage of the elastic scalability, ability to abstract infrastructure complexity, and agility possible in the cloud. Packaged in three flavors – Single-node Edition, Private Cloud Edition and Public Cloud Edition – Continuuity provides a cloud-based application runtime platform and a suite of one-of-a-kind developer tools.
It allows developers and companies to focus on the development of Big Data applications. Continuuity features visually rich UIs, simple APIs, push-button deployment from the developer’s local machine to the cloud, and dynamic scalability to meet application demands.
”Continuuity’s Big Data application fabric makes the middle tier relevant to Big Data,” said Tony Baer, principal analyst at Ovum. “It applies the same principles that allowed Java developers to scale enterprise applications to the multi-tiered, highly distributed environment of the Web. It allows Java developers to do what they do best — build repeatable applications that allow enterprises to incorporate Big Data into their operations.”
The Continuuity AppFabric is built on top of existing open source Hadoop infrastructure components while shielding developers from their complexity. Continuuity provides tools and building blocks to create applications quickly and deploy them to the Continuuity Big Data AppFabric. The AppFabric is available as a hosted cloud platform or can be integrated with an existing Hadoop/HBase installation.
Continuuity Developer Suite consists of a free, downloadable, fully featured single node edition of the Continuuity AppFabric and a Software Development Kit (SDK). It allows developers to build applications in their IDE, run, test and debug them on their local machines, and when ready, “push to cloud” with a single click.
Continuuity Private Cloud Edition, currently available in private beta, enables developers to deploy their Big Data apps to a single tenant, private cloud PaaS. Continuuity Public Cloud Edition, targeted for availability next year, allows them to deploy their Big Data apps to a hosted multi-tenant, self-service cloud PaaS that will support capacity on demand.
Once applications are deployed, Continuuity’s user interface gives insights into applications as they run in the cloud, indicating all activity and diagnosing any problems. An intuitive dashboard provides high-level aggregate metrics across all applications, while drill-down capabilities enable developers to visualize and understand the behavior of their applications. Continuuity provides real-time information about applications, allowing users to dynamically scale them at the touch of a button. For example, the UI will highlight insufficient application resources and the user can click a “+” button to automatically add the needed resources without taking the application offline or having to think about the underlying infrastructure.
“Until now, the barrier to building Big Data applications was insanely high. With the launch of Continuuity, we’re democratizing Big Data application development and making the developer’s life better across the entire application development lifecycle. Our platform will unleash a huge wave of developers building Big Data apps,” said Todd Papaioannou, co-founder and CEO of Continuuity.
Talend to Support NoSQL Databases
Talend announced the addition of support for widely-used and deployed NoSQL databases in Talend’s big data integration solutions, Talend Open Studio for Big Data and Talend Platform for Big Data.
Talend already provides graphical components that enable configuration for leading big data technologies such as Apache Hadoop’s HDFS file system or Apache Hive to provide random, real-time read/write, column-oriented access to big data. With support for leading NoSQL technologies including Cassandra, HBase and MongoDB, Talend enables organizations to attain the high levels of big data integration performance and inclusiveness. This support also extends the scope of Talend’s integration across the spectrum of transactional, operational and analytic data sources found in enterprise big data environments.
“This integration emphasizes Talend’s commitment to NoSQL technologies, which are becoming increasingly critical in big data deployments,” said Fabrice Bonan, co-founder and chief technical officer, Talend. “NoSQL databases give organizations the advantages of scale and flexibility of data structures, and are a good option for managing large amounts of data where the relationship between the data elements is less important.”
Built on Talend’s open source integration technology, Talend Open Studio for Big Data is a versatile open source platform for big data integration that natively supports Apache Hadoop, including connectors for Hadoop Distributed File System (HDFS), HCatalog, Hive, Oozie, Pig and Sqoop – in addition to the 450+ connectors included natively into the solution. Its code generation approach helps organizations of all sizes to reap the rewards of the vast amounts of data they have collected over time, at a fraction of the cost of traditional solutions.
NGDATA’s Lily CHH4-Certified
NGDATA announced that its flagship software platform Lily is now certified on CDH4 (Cloudera’s Distribution Including Apache Hadoop Version 4). Lily combines internal and external structured and unstructured data — from point of sale (POS) systems, enterprise applications, social media sites and more – into a single platform, making consumer insights actionable for enterprises. Lily uses machine-learning to generate a precise snapshot of consumer preferences and behaviors in real time.
Today, more than half of Fortune 50 companies run open source Apache Hadoop based on Cloudera. CDH4 is a 100-percent open source distribution that combines Apache Hadoop with other open source applications within the Hadoop stack to deliver advanced, enterprise-grade features.
“ It is critical to really know the customer and be able to anticipate the products consumers will need based on their purchase history and behavior across many channels, including social media,” said Luc Burgelman, CEO of NGDATA. “We’re excited to further develop our applications on the CDH4 framework to provide businesses with a comprehensive Big Data solution that provides them with a way to more effectively reach customers and drive profits.”
The Cloudera Certified Technology program, which NGDATA has joined, makes it simpler for Apache Hadoop technology buyers to purchase the right cluster components and software applications to extract the most value from their data. Cloudera Certified Technologies have been tested and validated to use supported APIs and to comply with Cloudera development guidelines for integration with Apache Hadoop.
“The Cloudera Certified Technology program is designed to make those choices easy and reliable,” said Tim Stevens, Vice President of Corporate and Business Development, Cloudera. “We’re committed to helping enterprises achieve the most from their Big Data initiatives, and we’re pleased that NGDATA has completed certification of Lily on CDH4.”