Startup Wrangles Machine Learning for Retailers
Earlier last month, Belgium-based startup, NGDATA, announced that it secured more funding to plow forward with its Lily big data management platform. The young company claims to have found a way to blend machine learning and massive real-time data into an easy to digest format for viewing and reporting—a claim that dozens of startups are making these days with the same basic premise—making big data actionable.
Lily is a data management platform combining storage, indexing and search with online, real-time usage tracking, audience analytics and a recommendation engine. Naturally, this might appeal to retailers, but the company could find some unique use cases outside of the obvious if it continues to grow.
In the tough competitive landscape for services like this, however, it takes a unique approach to stand out. To find out how NGDATA plans to tackle the future of big data needs in retail and beyond (not to mention carve out a meaningful niche in an already-noisy ecosystem) we talked with the company’s VP of product development, Steven Noels.
Your company describes Lily by stating that it’s a data management platform that combines storage, indexing/search and real-time tracking for analytics and recommendations…who is this message aimed at? What verticals are going to be most interested in your technology?
We’re positioning Lily as a Big Data management platform in 4 specific domains: news/media, retail/ecommerce, finance and telco. Generally speaking, Lily fits well in environments where “data meets usage”, i.e. where (possibly) lots of data is being interacted with by (possibly) large user audiences. This aspect of interactivity is crucial to our offer, as well as being close to creating business value around said data for our customers – i.e. enabling a direct financial return through up/cross-selling recommendations or interactive/real-time insights on customer behavior.
Can you give us the very high-level overview of what your platform does and how Hadoop fits into the picture?
Lily, on the lowest level, runs right on top of Apache HBase, is making use of Hadoop, and integrates closely with Solr as an index/search service.
The Lily core offers a Big Data repository that integrates storage, indexing and search into a single, unified, and comprehensively managed platform. On top of that, we offer a high-level, user-friendly API and data model and an integration bridge that allows to feed and maintain a variety of internal and external components. The entire architecture is fully distributed, which means we can scale well horizontally, accommodating huge volumes of data and usage (traffic, users, concurrency).
At this core level, we already address a real enterprise dev challenge, as maintaining a distributed (scalable) search index kept fully consistent with a BigTable-style datastore requires a lot of engineering effort on its own. Lily makes all of that a matter of setting up, configuration, and rolling out.
Using these core services, we provide a variety of higher-level functional modules for batch/interactive processing (the latter being Hadoop Map/Reduce-based), integration with machine learning technology (Apache Mahout), and a user profile and usage tracking store that we use as a driver for interactive analytics and recommendations.
Simply stated, we maintain a number of indexes based on tracking interactions of users with data, that are used to shorten compute time for recommendations and pattern recognition.
We use HBase as the underlying datastore as this allows us to keep data and track usage over extended ranges of time, which allows users to build rich historical user/usage profiles over longer periods of time.
What is involved with setting up a Lily cluster in an enterprise environment—and how does Lily integrate with existing environments?
Lily Enterprise offers tools and scripts for installation and management of multi-node setups, keeping all configs in sync over large clusters. It also provides a graphical environment for cluster administration, status and configuration. We integrate with a variety of enterprise systems, using ETL tools such as Pentaho Data Integration and Talend, but also support connectors for stream-based data ingestion and processing, e.g. Flume, Storm or Esper, and data analytics environments like Pig. Some of these are provided by us, others by partners.
You state that Lily is bringing big data technology and machine intelligence together—what does this mean for the average enterprise user who sees potential in the data available but isn’t sure what to do with it?
One aspect of critical importance is ease of use. We give enterprise users immediate access to a software product that doesn’t require him or her to spend months studying big data architecture books or exploring the continuously changing Big Data ecosystem at length.
Next, we have a partner model in which we combine domain competences of our partners with our technology expertise. Domain specialists can ship solutions running on top of Big Data technology faster when building on top of Lily. We ship a system that undercuts the low-level plumbing and helps them to be productive real fast.
One can make the argument that this is a very crowded space already—using those verticals you just mentioned in the first question that you want to appeal to, how can you stand out technologically from the pack?
I agree this is a crowded space, but is a young market as well, so the crowdedness is related to the huge industry interest in finally being able to tackle the real data problems of every company out there. It’s the Cambrian explosion all over again: a lot of new products and rapid evolution.
A lot of big data ecosystem vendors are currently focusing on Hadoop, which obviously is an admirable platform but is often used to harvest the proverbial needle from a haystack of very flat and very high-noise-to-low-signal-ratio data, in batch. A Hadoop run takes hours to process petabytes of low-value data into quality results. We strive to be the premier interactive big data vendor, hence our strategic choice for HBase – Lily has been designed to be used as an online data management system. By converting interactions into indexes and integrated machine learning, we are able to provide real-time recommendations – a unique and compelling technology offer.
Also, and quite importantly: we ship and have been shipping and supporting real customers now for almost a year. We’re on the next phase of our roadmap: adding new features and improving existing ones.
Please provide us with one solid use case if you can so we can put your technology in better context.
Retailers often have quite complex and diverse back-end systems for managing product data, customer data, POS systems, stock and price.
These systems are typically interacted with in batch-mode, in nightly ETL runs, because they can’t withstand the variable load that internet-facing systems are exposed to. Lily allows to aggregate data from all these different sources, both back- end front-end-derived, to scale with volume and usage, and to track how data (products) is being used (consulted, bought, appreciated) by users (customers). Tracking this usage behavior and translating that in real-time into product recommendations can generate direct benefits to retailers.