This Week’s Big Data Big Ten
It’s been another news-packed week in the world of big data in both scientific and enterprise realms.
Our top picks of the week (ending today, May 4) include data-intensive system news out of Australia, big genomics investments at key U.S. research centers, some multi-million sum investments in analytics platforms, and a few other items of interest, including HPCC Systems’ move into Asia.
Without further delay, let’s launch in with our first story that showcases how high performance analytics are being leveraged to fight tough diseases:
Turning the Big Data Tide Against MS
This week and researchers at The State University of New York at Buffalo are using advanced analytics to comb through a series of over 2,000 genetic and environmental factors that might lead to multiple sclerosis (MS) symptoms.
The researchers at SUNY have been developing algorithms for big data containing genomic datasets to help them uncover critical factors that speed up disease progression in MS patients. This is a complex task, however, given the size and diversity of the data.
According to Dr. Murali Ramanathan, Lead Researcher at SUNY Buffalo. “Identifying common trends across massive amounts of MS data is a monumental task that is much like trying to shoot a speeding bullet out of the sky with another bullet.”
Using an IBM Netezza analytics appliance with software from IBM business partner, Revolution Analytics, researchers can now analyze all the disparate data in a matter of minutes.
What this means is that on the research side, this is the first time that it has been possible to explore clinical and patient data to find hidden trends among MS patients by looking at factors such as gender, geography, ethnicity, diet, exercise, sun exposure, and living and working conditions.
The big data including medical records, lab results, MRI scans and patient surveys, arrives in various formats and sizes, requiring researchers to spend days making it manageable before they can analyze it.
SGI Lends Data-Intensive Might to Australian University
Queensland University of Technology was facing some challenges in the face of big data and needs to advance their HPC research capabilities. The university had computational processing requirements growing at the rate of 30% every six months and knew it had to act soon or it would quickly outpace capacity.
The university called in SGI, which tacked on 54 new processor nodes supplied by SGI as part of an ongoing and significant upgrade to the high performance computing (HPC) system in place at the university and boost data-intensive “big data” research capabilities.
The upgrade features 864 E5-2600-series Xeon cores. Each node has 16 cores and 128GB memory to handle the increase in demand for data-intensive processing, and for processing that requires a larger memory instance. One of their big data, high performance systems, the SGI UV 10 with 32 cores and 1TB of memory will also be integrated into the upgrade.
The SGI cluster will enable the university to continue its efforts in a number of research areas including Aviation and Airport environments, Biomedical Engineering, Smart Transport, Water and Environmental, Robotics, BioInformatics, Information Retrieval and Data Mining, Civil Engineering, Energy and Materials research. The goal of this initiative is to allow the university to sustain its distinguished track record in modelling, simulation and the analysis of critical problems faced by society.
“As a part of a longer-term plan, QUT is building a $230M Science and Engineering Centre (SEC) which will open in 2012 and house 500 academic staff and students,” said Chris Bridge, director of Information Technology Services at QUT. “These new developments, along with QUT’s expanding research profile, were the driving force behind this HPC upgrade.”
Opera Solutions Taps Visualization Expertise
Predictive analytics and machine learning company, Opera Solutions, announced this week that it would be looking to Advanced Visual Systems to lend visualization tools to its capabilities.
The company plans to make use of Advanced Visual Systems’ (AVS) OpenViz, a data visualization API with high-performance in-memory processing capabilities. They say that this will not only integrate interactive graphics into OS applications, but will also allow the visualization of machine output.
According to Steve Sukman, CEO of AVS, “Opera Solutions shares the AVS vision of next-generation analytics, where data integration, data processing and data presentation harmonically produce rapid and conclusive answers. We anticipate radical innovation in Big Data and data visualization from Opera Solution’s visionary strategy.”
AVS was one of the early arrivals to the data visualization market with many early customers in scientific and high performance computing. They claim that since their inception in 1991, they’ve provided data visualization tools to over 2,000 corporations, software makers and research teams.
NEXT — HPCC Storms Asia…>>>
HPCC Systems Storms Asia
HPCC Systems has been a company on the warpath in the last year as its raised funding and gone after new customers in the Hadoop world and beyond.
The company’s core technologies grew out of the need for LexisNexis to manage, sort, link, join and analyze billions of records within seconds.
As they describe it now, “HPCC Systems is an open source, data intensive supercomputer that has evolved for more than a decade, with enterprise customers who need to process large volumes of data in mission-critical 24/7 environments.”
The HPCC Systems technology has allowed the LexisNexis Risk Solutions business unit to grow to a $1.4 billion business in big data processing and analytics—a number they hope will grow with a series of new partnerships announced this week.
The company made four APAC alliances that already have an established presence in Asia: Canonical, Comrise Consulting, Supermicro and L&T Infotech. They hope to serve the Asia-Pacific market and establish HPCC Systems as a premier resource for big data solutions.
As Flavio Villanustre, Vice President of Products and Infrastructure for HPCC Systems said, “We are aware of the investments being made in data centers in Asia. Launching HPCC Systems and its ecosystem into Asia will help customers navigate options for addressing large data sets, reduce overall infrastructure costs, and improve business agility and data insight.”
The company feels that large companies in Asia need new ways to process, analyze, and find links and associations in high volumes of complex data significantly faster and more accurately than current technology systems. They say their platform scales linearly from tens to thousands of nodes handling petabytes of data and supporting millions of transactions per minute—something needed in finance and other sectors of Asian economies.
UCSC to Become Big Data Cancer Research Hub
The University of California, Santa Cruz, has now completed a first step in building infrastructure to push the goals of personalized medicine a bit closer.
Called the Cancer Genomics Hub (CGHub), the university has created a large-scale data repository and user portal for the National Cancer Institute’s cancer genome research programs.
CGHub’s initial “beta” release is providing cancer researchers with efficient access to a large and rapidly growing store of valuable biomedical data. The project is funded by the National Cancer Institute (NCI) through a $10.3 million subcontract with SAIC-Frederick Inc., the prime contractor for the National Laboratory for Cancer Research.
The group built CGHub to support all three major NCI cancer genome sequencing programs: The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI).
TCGA currently generates about 10 terabytes of data each month. For comparison, the Hubble Space Telescope amassed about 45 terabytes of data in its first 20 years of operation. TCGA’s output will increase tenfold or more over the next two years.
Over the next four years, if the project produces a terabyte of DNA and RNA data from each of more than 10,000 patients, it will have produced 10 petabytes of data (a petabyte is 1,000 terabytes). And TCGA is just the beginning of the data deluge, Haussler said, noting that 10,000 cases is a small fraction of the 1.5 million new cancer cases diagnosed every year in the United States alone.
New data compression schemes are expected to reduce the total storage space needed, so the CGHub repository is designed initially to hold 5 petabytes and to allow further growth as needed. That is still a massive amount of data, and CGHub will need to accommodate transfers of extremely large data files.
Managed by the UCSC team, the CGHub computer system is located at the San Diego Supercomputer Center. It is connected by high-performance national research networks to major centers nationwide that are participating in these projects, including UCSC. The repository relies on an automated query and download interface for large-scale, high-speed use. It will eventually also include an interactive web-based interface to allow researchers to browse and query the system and download custom datasets.
Jaspersoft PaaSes BI to Cloud
This week business intelligence company Jaspersoft announced they have teamed with VMware to integrate Jaspersoft’s BI Platform with the Cloud Foundry open Platform-as-a-Service (PaaS).
Jaspersoft’s BI for PaaS integrates with Cloud Foundry to provide easy reporting to enable data-driven, cloud-based application development. The company said that cloud models can enable organizations to harness large amounts of data with greater flexibility, while supporting more data-driven applications.
Jaspersoft says that combining their own BI platform with Cloud Foundry provides developers with a scalable and cost-effective PaaS deployment method for enabling data-driven applications. This allows enterprises to deploy analytic applications to uncover patterns in their data, driving improved organizational performance while gaining better access to information for more informed decision-making.
With Cloud Foundry, VMware offers a competitive PaaS solution, built on open source technologies. Jaspersoft will offer an open source edition of its flagship Jaspersoft BI Suite, pre-configured for Cloud Foundry, which supports the ability to access relational databases like MySQL and big data services like MongoDB. Jaspersoft provides native access to various Big Data products, including MongoDB, for which it recently announced the BI connector.
Sequoia Pushes Millions Birst’s Way
It’s a great week at the California-based headquarters of business analytics company Birst this week. The company just scored a cool $26 million investment led by Sequoia Capital.
Birst says that with this refresher they’ll set to work, using the new funds to further accelerate further growth, hit the product development grindstone and expand into new markets.
Birst has grown considerably in the past year. The company more than doubled its revenues and increased its customer base by more than 40 percent, adding customers including Aruba Networks, en World Japan, Five9, Grupo Tress, Host Analytics, Motorola, oDesk, Saba, SunCap Financial, and Swann Insurance, among others.
In the last year, Birst unveiled multiple industry-leading products and services, including the industry’s first SaaS-based BI appliance, the first in-memory database optimized specifically for analytics, the first cloud-based mobile business intelligence SDK for the iPad, and support for Hadoop and Big Data analytics.
“This is an extraordinary time for us. We founded Birst to change the way the world used and interacted with BI and by pushing the envelope of possibility we are witnessing great success,” said Brad Peters, CEO and Co-Founder of Birst. “This investment furthers Birst’s ability to continue to drive innovation and expand our solution to new markets and new audiences. We are thrilled to have world-class investors such as Sequoia Capital by our side.”
Netuitive Expands Predictive Analytics Possibilities
This week predictive analytics company Netuitive shed light on their upcoming self-titled 6.0 platform release, which they say features the industry’s first open API to a predictive analytics platform.
According to the company, the goal was to extend usability and eliminate vendor lock-in by enabling extensible data integration and correlation capabilities required to achieve end-to-end application performance management (APM).
Netuitive says that one of the biggest challenges for APM is how to quickly correlate and extract value from the vast and unmanageable amounts of data (e.g.,: business activity, customer experience, applications, infrastructure) collected from a plethora of specialized monitoring tools.
“The APM market is confronted with a big data problem. As our customers have progressed in the deployment of more agents to monitor their real-time performance at the business, application and infrastructure levels, they have been confronted with a deluge of data that is humanly impossible to correlate and interpret with traditional monitoring tools,” said Nicola Sanna, CEO of Netuitive.
On this new analytics platform, data is collected and normalized in Netuitive’s Performance Management Database (PMDB), analyzed by Netuitive’s Behavior Learning technology, with actionable outputs delivered based on the analysis. Central to this capability are APIs enabling integration with configuration management databases, incident management, and programmatic administration of users security and policies.
GridEngine Makes Progress to Tackle Big Data
This week the relatively new owners of Grid Engine, Univa, announced the management software’s version 8.1, which will be available soon for a beta.
The widely deployed, distributed resource management software platform used by enterprises and research organizations across the globe. Univa says Grid Engine handles workload management and big data integration solutions while saving time and money through increased uptime and reduced total cost of ownership.
Corporations in the industries of Oil and Energy, Life Sciences and Biology, and Semiconductors rely on Univa Grid Engine when they need mission-critical computing capacity to model and solve complex problems.
Along with the release, this time targeted at big data needs as well as it traditional bread and butter segments (high performance computing in research and enterprise):
- Processor core and NUMA memory binding for jobs which enables applications to run consistently and over 10% faster
- Job Classes describing how applications run in a cluster, slashing the time to onboard and manage workflow
- Resource maps which define how hardware and software resources are ordered and used in the cluster helping to improve throughput and utilization of the cluster
- Improved Job Debugging and Diagnostics allowing administrators discover issues in less time
- New support for Postgres database job spooling that balances speed of submission with reliability in high volume clusters that have lots of small jobs
- Documented and tested integrations with common MPI environments allows for valuable time saved since Univa has done the integration work.
Gary Tyreman, president and CEO Univa Corporation said this week that “With the introduction of Univa Grid Engine Version 8.1, our third production release within the last 12 months, Univa has again proven its ability to support the largest sites in the world and to simultaneously deliver significant new products and features. The proof is in our execution for our customers. Univa provides seamless evolution for Grid Engine sites.”