HPCC Systems Intros Machine Learning Beta
For those who aren’t aware, HPCC Systems is the commercially-oriented competitor to Hadoop that was designed natively at LexisNexis Risk Solutions to manage mission-critical decision-making capabilities internally. The company commercialized it under the name HPCC Systems and since last year, it has raised a great many eyebrows—not to mention revenue.
This week HPCC Systems announced the beta version of a new set of machine learning and matrix processing algorithms directed at enterprise and research data scientists.
The company says that for developers with business intelligence and predictive analysis problems these algorithms will cover a large number of machine learning elements. The machine learning library, which will take advantage of full parallelization within HPCC Systems’ architecture, will support supervised and unsupervised learning, document and text analysis, statistics and probabilities and general inductive inference related problem needs.
As HPCC Systems describes it, the machine learning beta project is designed to create an extensible library of fully parallel machine learning routines; “the early stages of a bottom up implementation of a set of algorithms which are easy to use and efficient to execute. This library leverages the distributed nature of the HPCC Systems architecture, providing for extreme scalability to both, the high level implementation of the machine learning algorithms and the underlying matrix algebra library, extensible to tens of thousands of features on billions of training examples.”
Major machine learning algorithms have been worked into the library release, including popular k-means for clustering, naïve Bayes classifiers, ordinary linear regression in linear correlation, and association routines to perform association analysis and pattern prediction. As an HPCC Systems document stated this week, “The document tokenization and text classifiers included, with n-gram extraction and analysis, provide the basis to perform statistical grammar inference based natural language processing. Various methods in univariate statistics are also available.”
According to Armando Escalante, Senior VP and CTO of HPCC Systems (part of LexisNexis Risk Solutions), “With this tool users don’t have to summarize the data. We’ve seen others claim that they are doing machine learning on big data when, in fact, they are doing machine learning on summarized big data, using legacy and traditional tools.”
Escalante continued, noting that another powerful element of the ML library is that it can leverage both the ECL language, which will cut down on the number of Java developers as well as the parallelization power behind their HPCC Systems platform.