Features

Big Data Outliers: Friend or Foe?

Sep 2, 2014 |

The bigger your dataset, the greater your chance of stumbling into an outlier. It’s practically a certainty you’ll find isolated, unexpected, and possibly bizarre data you never expected to see in your data. But how you respond to these outliers could mean the difference between big data success and failure. How should you deal with data outliers? The answer is simple: It depends. On the one hand, the presence of outliers may be a sign of serious data quality issues, Read more…

Five Steps to Running ETL on Hadoop for Web Companies

Sep 1, 2014 |

Mention ETL (Extract, Transform and Load) and eyes glaze over. The thought goes: “That stuff is old and meant for clunky enterprise data warehouses. What does it have to do with my Internet/Web/ecommerce application?” Quite a lot, actually. ETL did originate in enterprise IT where data from online databases is Extracted, then Transformed to normalize it and finally Loaded into enterprise data warehouses for analysis. Although Internet companies feel they have no use for expensive, proprietary data warehouses, the fact Read more…

Coping with Big Data at Experian – “Don’t Wait, Don’t Stop”

Sep 1, 2014 |

Experian is no stranger to Big Data. The company can trace its origins back to 1803 when a group of London merchants began swapping information on customers who had failed to meet their debts. Fast forward 211 years. The rapid growth of the credit reference industry and the market for credit risk management services set the stage for the reliance on increasing amounts of consumer and business data that has culminated in an explosion of Big Data. Data that is Read more…

Hadoop Labor Update: Cloudera Talks Impala 2.0 as Hortonworks Previews Kafka

Aug 29, 2014 |

Say what you will about Hadoop (and we do), the big data platform is evolving at an incredible rate. This week, two of the biggest Hadoop distributors, Hortonworks and Cloudera, shared how they’re working to improve two key aspects of the platform: real-time data pipelining via Apache Kafka and SQL-based data warehousing via Impala. Let’s start with Cloudera. This week, the Hadoop distributor announced that the upcoming release of Impala 2.0 will add much more complete SQL functionality to CDH, Read more…

What Exactly Is Big Data, If It’s Neither About Big Nor Data?

Aug 29, 2014 |

Many people ask us why there doesn’t seem to be an accepted definition for Big Data in view of the massive press it receives. And given the amount of marketing effort expended by the global IT vendors who are targeting this bandwagon (and the venture-funded start-ups) they can be forgiven for being confused. Just as most consumer marketing – curiously – is focused on adolescent buyers with a fraction of the purchasing power of their parents and grandparents, so it Read more…

Who IBM’s Server Group Turns To for Machine Data Analytics

Aug 28, 2014 |

IBM’s engineering prowess is second to none, and its Systems and Technology Group builds the computers that run the world’s biggest companies. But when IBM’s STG unit went looking for a way to predict failures by analyzing log data returned by its customers’ servers and storage arrays, it looked externally to a little-known machine data analytics startup from Santa Clara. Glassbeam got its start five years ago, before the Internet of Things (IOT) became the industry’s hottest buzzword and out-hyped Read more…

Are Meetings a Waste of Time? Data Analytics Weighs In

Aug 27, 2014 |

Have you ever wondered whether big meetings are a giant waste of everybody’s time? Now a data analytics startup named VoloMetrix is putting data science to that age-old question, and the answers it’s generating might surprise you. In most industries, the cost of human capital is the most expensive line item on the annual budget (the oil and gas business is one exception). The salaries and benefits that companies pay their workers routinely account for 50 to 70 percent of Read more…

Why Hadoop Isn’t the Big Data Solution You Think It Is

Aug 26, 2014 |

Hadoop carries a lot of promise in the IT world for the way it has democratized access to massively parallel storage and computational power. But the level of hype that surrounds Hadoop is disproportionate to its present capabilities, raising the possibility of a big data letdown of elephantine proportions. The emergence of Hadoop as a next-generation platform for parallel computing has piqued the interest of customers and investors alike. What mid-sized company looking for a big data edge wouldn’t want Read more…

How to Move 80PB Without Downtime

Aug 25, 2014 |

When the online photo company Shutterfly decided to move its entire data center recently, the possibility of downtime was a big issue. After all, the company had 80 petabytes of customer data spread across tens of thousands of spinning disks, and those disks wouldn’t be spinning while being physically moved. Months later, after the last deliver made its way to Shutterfly‘s new data center, not one piece of data was lost or even temporarily unavailable from the company’s website. How Read more…

The Evolution of the Data Scientist

Aug 25, 2014 |

The role of the data scientist is currently one of the most in-demand jobs in the tech industry. As more businesses turn to big data analytics for insights into their customers, trends in their industries and to gain a competitive edge, this role is constantly evolving and moving from obscure to mainstream. Coined by Jeff Hammerbacher and DJ Patil in Silicon Valley in 2008, the data scientist is facing new challenges as the information this individual works with is growing Read more…