Follow Datanami:
April 20, 2012

This Week’s Big Data Big Ten

Datanami Staff

Welcome to this week’s summary of the top ten stores in big data for the week ending April 20, 2012.

On tap this week we have news from the edges of genomics research, to the sad state of data protection, to a new program that would make Stan Lee proud—all the way to news about how to scrub dirty data. This is in addition to some recent big data IPO activity to new concepts that are reshaping data-intensive enterprise computing.

Without further delay, we’ll begin our foray with #1…

 


Convey’s Genomics Powerhouse for German Researchers

Late this week Convey Computer, which keeps one foot rooted in research and high performance computing and another in high-end verticals including financial services, announced that they would be packing CLC bio’s enterprise genomics platform to aid researchers at the German Helmholtz Centre for Infection Research.

The integrated solution will allow the research center easier access to the advanced genomics platform said Dr. Robert Geffers, who heads the center’s Genome Analytics Group. Geffers pointed to the large number of sequencing projects his group has running in parallel, saying that they need to be able to crunch the data quickly, but in a limited, efficient amount of space.

Convey’s hybrid-core architecture pairs classic Intel processors with a coprocessor comprised of FPGAs. Particular algorithms — DNA sequence assembly, for example — are optimized and translated into code that’s loadable onto the FPGAs at runtime, which has been shown to greatly accelerate some performance-critical applications.

By providing the integrated solution, he says that Convey has solved their problems by “providing high performance, compact size, and easy integration with blades we already have.” He went on to note that “Also, many of our analysis pipelines previously required using a command line and scripts that can be difficult for researchers and clinicians to work with. The CLC bio user interface eliminates these obstacles, putting an expansive suite of tools in the researchers’ hands.”

Next — The League of Analytic Superheroes? >>>


The League of Analytic Superheroes

What do get when you combine graphic novels detailing the analytics exploits of mega-mind researchers and enterprise data scientists and two of the largest data-driven companies in the world?…

Why, nothing less than the marketing genius of two of the world’s largest data-driven companies, of course.

Last month SAS and Teradata united forces to form the League of Analytic Superheroes, an effort they described as a “talent search for the best-of-the-best individuals, teams and companies who are masters of integrating analytic solutions from SAS and Teradata to produce earth-shattering insights and tangible business value.”

“The superhero initiative from SAS and Teradata serves an educational purpose – raising awareness of the powerful potential of data-driven analytics.

Our world is drowning in data, but thirsty for actionable information,” said Diego Klabjan, PhD, Associate Professor of Industrial Engineering and Management Sciences and Director of the Master of Science in Analytics at Northwestern University.

“This program may catch the interest of college students considering a career in analytics, a field experiencing a shortage of talent.” Klabjan said Northwestern University’s Master of Science in Analytics program is one of the first of its kind, designed to prepare students to meet the global demand for data analytics practitioners.

The fruit of their labor is becoming apparent as the contest continues to thrive with winners being announced at the SAS World Forum in Orlando next week.

NEXT — Spelunking for an IPO >>>

 

Spelunking for an IPO

This week Splunk Inc., a company that focuses on “operational intelligence” on machine data (for the most part unstructured data that comes off everything from website transactions to sensor or application runoff) made an anticipated IPO announcement.

The company, which we can soon refer to simply as SPLK, says its enterprise platform, which collects, monitors, indexes and analyzes the machine data generated by IT applications and infrastructure has attracted around 3,700 customers since 2006.

They claim their success is due to the ability to handle machine data at scale, providing a record of all transactions, systems, applications, user activities, security threats and fraudulent activity—all sources of potentially valuable data that haven’t been tapped by most companies.

To put their offering in context, a large media outlet like NPR uses Splunk to measure the ebb and flow of its online listeners, evaluate the effectiveness of new programs and campaigns, optimize resource allocation and content delivery and more accurately account for revenue sharing and royalty payments.

In another use case of this type of technology, WhitePages uses Splunk to gather and analyze an extensive amount of operational data and user metrics without having to build or maintain a data warehouse. This data is leveraged on a constant basis to help WhitePages make better decisions.

The company announced the pricing of its initial public offering of 13,500,000 shares of common stock at a price to the public of $17.00 per share. A total of 12,507,278 shares are being offered by Splunk, and a total of 992,722 shares are being offered by selling stockholders.

 In addition, Splunk has granted the underwriters a 30-day option to purchase up to an additional 2,025,000 shares to cover over-allotments, if any. Splunk will not receive any proceeds from the sale of shares by the selling stockholders.

NEXT — The Sad State of Data Protection >>>

 

The Sad State of Data Protection

According to a recent report from data governance platform company, Varonis, data protection is woefully inadequate for many enterprise users, especially those who have compliance considerations.

Back in March the company surveyed over 200 individuals in the IT community, asking about their current data protection practices and confidence levels, and how data protection practices correlate with data protection activities.

While it’s probably in their best interest to play up potential data governance flaws (so ready that grain of salt) the company says that “While over 80% reported that they store data belonging to customers, vendors, and other business partners, only 26% reported being very confident that data stored within their organization is protected.”

As Varonis says,in the age of big data, businesses are creating, processing, storing, and sharing information at an alarming rate.  A significant amount of the data is highly sensitive or confidential and should be properly safeguarded.  It’s unnerving to think about the possibility of our own personal information sitting on servers, possibly unencrypted and open to everyone.

NEXT — Startup Taps IBM Talent…>>>

 

Analytics Startup Taps IBM Talent  

Illinois-based startup, XtremeData, which got its start in 2005, just acquired the talent of IBM information management expert, Mike Lamble, as president. In this new role, Lamble will guide acceleration of go-to-market programs for the company’s data analytics technology offerings.

Lamble is coming on board with a company that claims a unique position in a sea of competitions. According to XtremeData, the past few years have seen the rise of numerous NoSQL solutions and a handful of parallel SQL databases.

Within this SQL ecosystem, they claim their dbX product is the only row-based parallel database that has been engineered from first principles. As the company states, “While others were slapping together federations of open-source database engines, XtremeData’s engineering team took a fresh, blank slate approach to the core database engine.”

The company says that when it got its start, they felt that fundamental re-thinking and engineering was necessary to fully leverage multi-core CPUs, vector engines, faster memory systems and higher network bandwidths—all concepts that led to dbX.

In essence, dbX is a natively parallel database engine with a core SQL execution engine that is vector-oriented to enable continuous acceleration benefits with evolving technology. In the early days, the company implemented plug-in acceleration via a patented FPGA module. As CPU technology continued to evolve rapidly, especially with Intel’s Nehalem processors, they shifted focus to acceleration via many-Core CPUs. Today dbX is offered as a software-only product that is deployed in “commodity” data centers on large Linux clusters of x86 processors.

Most recently Lamble was a practice leader in IBM’s analytics business unit. Prior to IBM, he was VP at Greenplum, a division of EMC focusing on big data analytics. Before that he was a managing partner of Knightsbridge Solutions, a consultancy specialized in business intelligence, data warehousing, and data integration and acquired by HP.

“We have a new leader who very clearly understands the big data market,” said Ravi Chandran, XtremeData’s co-founder and CTO. “Mike is a veteran in the big data space and brings in-depth experience from Fortune 100 technology companies as well as a successful track record at start-ups. His industry knowledge and shared vision for our business equips Mike to lead XtremeData to the next level.”

NEXT — Putting Dirty Data to Shame… >>>

 

Pentaho Puts Dirty Data to Shame

Data quality issues are getting their much-needed airtime lately with news about the need to expunge dirty data. This week Pentaho urged its business intelligence users to clean their dirty data via their addition of data quality features for its open source-spirited Pentaho Business Analytics platform.

To achieve this, the company partnered with data quality startup Human Inference, to tightly integrate their platform EasyDQ with Pentaho Business Analytics for integrated data quality management.

According to the company, “Tightly integrated into the data integration capability of Pentaho Business Analytics and delivered via the cloud or on-premise, Human Inference enables customers to quickly build business intelligence applications with more accurate data, driving better and faster decisions.”

Data Quality is a major step in Pentaho’s roadmap to build the future of analytics. This data quality component includes:

  • Data Profiling
  • Name Validation, Standardization and Cleansing
  • Address Validation, Standardization and Cleansing
  • E Mail and Telephone Validation, Standardization and Cleansing
  • Duplicate Detection and Merge Duplicates

The solution is available immediately and can be downloaded as a plug-in for the existing Pentaho Data Integration / Kettle releases 4.2.x and later.

“Dirty data remains a major barrier in providing accurate and timely business analytics to end users. Yet until now, the cost and complexity of existing data quality solutions meant that many companies simply could not integrate or include data quality as part of their overall business analytics operations,” said Barry Godthelp, VP Sales, Human Inference.

NEXT — More Analytics Moving to Amazon Cloud >>>


More Analytics Moving to Amazon Cloud

It’s getting easier for business intelligence and analytics users to free themselves from the constraints and costs of physical hardware. This week two more analytics companies announced that they were finding a second home in the Amazon cloud via the Amazon Web Services Marketplace.

This week Metamarkets, a big data analytics outfit aimed at online companies announced its availability in the AWS Marketplace, which they say is valuable because it’s an online store that makes it easy for customers “to find, compare, and immediately start using the software and services they need to build software systems and products, and run their businesses.”

With Metamarkets, users can monitor key metrics, answer data-related questions for decision support, and discover new trends leveraging event-based data generated and stored on AWS just as they would on physical hardware, but with instance options and a pay-as-you-go model.

In addition, business intelligence company Jaspersoft announced that it too can now be found in the cloud. In its announcement, Jaspersoft execs said that, “With the launch of AWS Marketplace, Jaspersoft is now available as a pre-configured Amazon Machine Image (AMI), enabling customers to deploy it in the Amazon Compute Cloud (EC2) with just a few mouse clicks.  Access to Jaspersoft gives customers the ability to instantly build advanced, scalable reporting capabilities to uncover new insights into their projects and businesses.”

By offering JasperReports Server within the AWS Marketplace, both companies give BI professionals powerful, yet lightweight reporting and design tools on a secure, scalable cloud infrastructure. Accessing secure interactive reports and charts through JasperReports Server on AWS requires no hardware or software installations, and runs completely within the AWS cloud environment for easy deployment.

NEXT — Cloudera Gets Pervasive…>>>

 

Cloudera Gets Pervasive

This week Pervasive Software announced that it would be working with Cloudera to create certification with Pervasive RushAnalyzer 1.2.2 as a Cloudera Certified Technology.

Designed for data scientists, business analysts and big data developers, Pervasive syas that its RushAnalyzer provides end-to-end data access, transformation, analysis, visualization and delivery.

The key to RushAnalyzer, says Pervasive, is that it allows users to leverage these capabilities without getting their hands dirty with code writing. This creates a “power to the user” approach and lets business users build and deploy their own advanced analytics solutions on multiple platforms, including Apache Hadoop. The company claims that despite the specialized know-how involved with such an undertaking, they are nonetheless eliminating the need for specialized big data programming skills.

The Cloudera Certification process validates product compatibility with Cloudera’s Distribution Including Apache Hadoop (CDH), which provides a streamlined path for putting Apache Hadoop to work harnessing big data to solve business problems. Similarly, Pervasive RushAnalyzer provides a simplified route for users to get immediate value out of their Hadoop implementations.

 The Cloudera assessment addresses overall architecture, observance of the Apache Hadoop interface classification system, integration sufficiency, compliance with Cloudera support policies and requirements, and cluster capability using real-world workloads and micro-benchmarks.

“Our customers value the assurance provided by Cloudera’s certification in providing tested and validated products that work with Apache Hadoop to successfully tackle big data,” said Mike Hoskins, Pervasive CTO and general manager, Pervasive Big Data Products and Solutions. “Together, Pervasive and Cloudera are focused on enabling companies to rapidly discover new insights and deploy operational solutions leveraging big data at any scale, and this certification provides certainty in making investments in this platform.”

NEXT — Keeping a Lucid Imagination…>>>

 

Keeping a Lucid Imagination

Lucid Imagination deals in Apache Lucene/Solr open source enterprise search technology. The company provides free certified distributions, documentation, commercial-grade support, training, high-level consulting and value-added software for Lucene and Solr.

The small startup’s roll of customers include household names like AT&T, Sears, Ford, Verizon, Cisco, Elsevier, Zappos, The Guardian, and Macy’s, among a number of others.

Built atop Apache Lucene/Solr 4.0-dev, LucidWorks Enterprise provides capabilities that enable search of both structured and unstructured data located across an organization. This is one of only a handful of companies that are making direct use out of the Apache Lucene/Solr project. According to Lucid Imagination, they’ve managed to nab eight of the 30 folks who are core committers to the open source project.

The company says the latest release includes enhancements in scalability, availability, performance, security and ease-of-use.

Lucid Works Enterprise 2.1

LucidWorks Enterprise 2.1 uses Lucene/Solr as the base of Enterprise 2.1, which they say adds enterprise-class features and benefits that meet the demanding needs of organizations of any size. Enhancements provided in LucidWorks Enterprise 2.1 include:

Crawler Configuration

  • Ability to schedule external data source crawling
  • New connectors for high-speed Hadoop Distributed File System (HDFS), Twitter and Content Management Interoperability Services (CMIS)
  • New framework to build custom connectors

Business Rules Integration

  • Out-of-the-box integration with Drools Business Rules Management System (BRMS)
  • New framework for integrating other BRMSs
  • Ability to integrate business processes to influence search results and relevancy, modify facets, sort, and filter parameters, and divert search queries to a landing page. Business rules also allow businesses to modify the documents during the indexing process.

Dynamic Fields

  • Creation and configuration of fields directly from the user interface and REST API
  • Schema-free configuration

Advanced Memory Settings

  • Ability to manage buffer and cache settings directly from a secure admin user interface

Additional Important Enhancements

  • Configuration of real-time search directly from the user interface
  • Ability to crawl and index data at web scale through Nutch integration
  • Ability to promote documents to the top of the search results

LucidWorks Enterprise is available in two configurations: On-premise and Cloud (through LucidWorks Cloud). The company focuses on Apache Lucene/Solr, offering support, services and training to the open source community in addition to its commercial offering, LucidWorks Enterprise.

According to the company’s CEO, Paul Doscher, “Enterprise search provides one of the most powerful discovery tools available to organizations of any size. Layered on top of Lucene/Solr, LucidWorks adds enterprise-grade security, along with time-saving operational tools that streamline enterprise search configuration, deployment and operations. Offering these valuable search capabilities in the cloud increases an organization’s agility while allowing users to leverage untapped information assets in a cost-effective and timely manner. Hundreds of companies worldwide have turned to Lucid to gain these benefits quickly and easily through our unique cloud offering.”

NEXT — Jumping Through Clouds to Hadoop…>>>


Dell Partners to Support Hadoop, Cloud Initiatives

In an effort to extend to the Hadoop and big data cloud crowd, Dell announced its Emerging Solutions Ecosystem, which pulls a new host of partners into the fray, including Ubuntu vendor Canonical, cloud infrastructure management layer provider enStratus and OpenStack-based Mirantis.

The company says the partnership program designed to deliver complementary and interoperable best-of-breed hardware, software and services components as part of its emerging technology solutions such as the Dell Apache Hadoop Solution and the Dell OpenStack-Powered Cloud Solution.

Dell has delivered specific solutions to market on the Hadoop big data framework and the OpenStack open source cloud platform. They claim that the Emerging Solutions Ecosystem “will enable expansion of the scope of those solutions and to continue to meet customer needs. As a result, customers can benefit even more from deploying these enhanced and validated solutions, which are optimized for maximum performance and scalability, and provide a single point of contact with whom to interact.”

The Emerging Solutions Ecosystem play focuses on emerging technologies, with two key areas of big data and cloud enablement—two areas Dell has been looking to tie in with, mostly through acquisitions and such partnerships.

Datanami