The 10 Commandments of Business Intelligence in Big Data
Organizations today don’t use previous generation architectures to store their big data. Why would they use previous-generation BI tools for big data analysis? When looking at BI tools for your organization, there are 10 “Commandments” you should live by.
First Commandment: Thou Shalt Not Move Big Data
Moving Big Data is expensive: it is big, after all, so physics is against you if you need to load it up and move it. Avoid extracting data out into data marts and cubes, because “extract” means moving, and creates big-data-sized problems in maintenance, network performance additional CPU — on two copies that are logically the same. Pushing BI down to the lower layers to run at the data is what motivated Big Data in the first place.
Second Commandment: Thou Shalt Not Steal!…Or Violate Corporate Security Policy
Security’s not optional. The sadly regular drumbeat of data breaches shows it’s not easy, either. Look for
BI tools that can leverage the security model that’s already in place. Big Data can make this easier, with unified security systems like Ranger, Sentry and Knox; even Mongo has an amazing security architecture now. All these models allow you to plug right in, propagate user information all the way up to the application layer, and enforce a visualization’s authorization and the data lineage associated with it along the way. Security as a service: use it.
Third Commandment: Thou Shalt Not Pay for Each User, Nor Every Gigabyte
One of the fundamental beauties of Big Data is that when done right, it can be extremely cost effective. Putting five petabytes of data into Oracle could break the bank; but you can do just that in a big data system. That said, there are certain price traps you should watch out for before you buy. Some BI applications charge users by the gigabyte, or by gigabyte indexed. Caveat emptor! It’s totally common to have geometric, exponential, logarithmic growth in data and in adoption with big data. Our customers have seen deployments grow from tens of billions of entries to hundreds of billions in a matter of months, with a user base up by 50x. That’s another beauty of big data systems: Incremental scalability. Make sure you don’t get lowballed into a BI tool that penalizes your upside.
Fourth Commandment: Thou Shalt Covet Thy Neighbor’s Visualizations
Sharing static charts and graphs? We’ve all done it: Publishing PDFs, exporting to PNGs, email attachments, etc. But with big data and BI, static won’t cut it: All you have is pretty pictures. You should be able let anyone you want interact with your data. Think of visualizations as interactive roadmaps for navigating data; why should only one person take the journey? Publishing interactive visualizations is only the first step. Look ahead to the Github model. Rather than “Here’s your final published product,” get “Here is a Viz, make a clone, fork it, and this is how I derived at those insights, and see what other problem domains it applies to.” It lets others learn from your insights.
Fifth Commandment: Thou Shalt Analyze Thy Data In Its Natural Form
Too often, I hear people referring to big data as “unstructured.” It’s far more. Finance and sensors generate tons of key value pairs. JSON — probably the trendiest data format of all — can be semi-structured, multi-structured, etc. MongoDB has made a huge bet on making sure data should stay in this format: Beyond its virtues for performance and scalability reasons, expressiveness gets lost when you convert it into the rows and tables. And lots of big data is still created in tables, often with thousands of columns. And you’re going to have to do relational joins over all of it: “Select this from there when that…” Flattening can destroy critical relationships expressed in the original structure. Stay away from BI solutions that tell you “please transform your data into a pretty table because that’s the way we’ve always done it.”
Sixth Commandment: Thou Shalt Not Wait Endlessly For Thine Results
In 2016 we expect things to be fast. One classic approach is OLAP cubes, essentially moving the data into a pre-computed cache, to get good performance. The problem is you have to extract and move data to build the cube before you get performance (see Commandment #1). Now, this can work pretty well at a certain scale … until the temp table becomes gigantic and crashes your laptop by trying to materialize it locally. New data will stop analysis in its tracks while you extract that data to rebuild the cache. Be wary of sampling too, you may end up building a visualization that looks great and performs well before you realize it’s all wrong because you didn’t have the whole picture. Instead, look for BI tools that make it easy to continuously change which data you are looking at.
Seventh Commandment: Thou Shalt Not Build Reports, But Apps Instead
For too long, ‘getting the data’ meant getting a report. In big data, BI users want asynchronous data from multiple sources so they don’t need to refresh anything — just like anything else that runs in browsers and on mobile devices. Users want to interact with the visual elements to get the answers they’re looking for, not just cross-filtering the results you already gave them. Frameworks like Rails made it easier to build Web applications. Why not do the same with BI apps? No good reason not to take a similar approach to these apps, APIs, templates, reusability, and so on. It’s time to look at BI through the lens of modern web application development.
Eighth Commandment: Thou Shalt Use Intelligent Tools
BI tools have proven themselves when it comes to recommending visualizations based on data. Now it’s time to do the same for automatic maintenance of models and caching, so your end user doesn’t have to worry about it. At big data scale, it’s almost impossible to live without it, there’s a wealth of information that can be gleaned from how users interact with the data and visuals, which modern tools should use to leverage the data network effects . Also, look for tools that have search built in for everything, because I’ve seen customers who literally have thousands of visualizations they’ve built out. You need a way to quickly look for results, and with the web we’ve been trained to search instead of digging through menus.
Ninth Commandment: Thou Shalt Go Beyond The Basics
Today’s big data systems are known for predictive analytical horsepower. Correlation, forecasting, and more, all make advanced analytics more accessible than ever to business users. Delivering visualizations that can crank through big data without requiring programming experience empowers analysts and gets beyond a simple fixation on ‘up and to the right.’ To realize its true potential, big data shouldn’t have to rely on everyone becoming an R programmer. Humans are quite good at dealing with visual information; we just have to work harder to deliver it to them that way.
Tenth Commandment: Thou Shalt Not Just Stand There On the Shore of the Data Lake Waiting for a Data Scientist To Do the Work
Whether you approach Big Data as a data lake or an enterprise data hub, Hadoop has changed the speed and cost of data and we’re all helping to create more of it every day. But when it comes to actually using big data for business users, it is too often a write-only system: Data created by the many is only used by the few.
Business users have a ton of questions that can be answered with data in Hadoop. Business Intelligence is about building applications that deliver that data visually, in the context of day-to-day decision making. The bottom line is that everyone in an organization wants to make data-driven decisions. It would be a terrible shame to limit all the questions that big data can answer to those that need a data scientist to tackle them.
About the author: Shant Hovsepian is CTO and co-founder at Arcadia Data. Shant is responsible for long-term innovation and technical direction. Previously, he was with Aster Data (later acquired by Teradata), where he was an early member of the engineering team and worked on numerous features across the stack, including high performance cluster inter-connects, data storage, compression, and distributed query planning.