Many small and medium sized businesses would like to get in on the big data game but do not have the resources to implement parallel database management systems. That being the case, which relational database management system would provide small businesses the highest performance?
This question was asked and answered by Marissa Hollingsworth of Boise State University in a graduate case study that compared the performance rates of MySQL, Hadoop MapReduce, and Hive at scales no larger than nine gigabytes.
Hollingsworth also used only relational data, such as payment information, which stands to reason since anything more would require a parallel system. “This experiment,” said Hollingsworth “involved a payment history analysis which considers customer, account, and transaction data for predictive analytics.”
The case study, the full text of which can be found here, concluded that MapReduce would beat out MySQL and Hive for datasets larger than one gigabyte. As Hollingsworth wrote, “The results show that the single server MySQL solution performs best for trial sizes ranging from 200MB to 1GB, but does not scale well beyond that. MapReduce outperforms MySQL on data sets larger than 1GB and Hive outperforms MySQL on sets larger than 2GB.”
Hollingsworth ran her tests on sample data provided to her by what she calls “CompanyX,” a software company near Boise that wished to remain anonymous. Along with being a computer science graduate student at Boise State, Hollingsworth works for HP Indigo as a Software Design Engineer. It is possible that Hollingsworth was able to obtain this data as a result of working for HP Indigo, but that matters little.
CompanyX’s motivation is that they want to perform predictive analytics on their payment information. One benefit of this is that they, like everyone else, have customers are consistently late with their payments. CompanyX automatically hires out a collection agency to follow up on late and delinquent charges, a collection agency that CompanyX has to pay per inquiry. If CompanyX can identify customers who may always pay late but do always pay, they would not have to ask the collection agency to inquire about that customer. It is unclear if those saved expenses would cover the cost of implementing a predictive analytics system, but there are other benefits as well.
The data that CompanyX gave her was not enough, however. Hollingsworth had to expand on it, creating data that was statistically similar to the datasets given to her. While this creation of data may sound scientifically sketchy, Hollingsworth simply needed test data to feed into the management systems. Further, CompanyX gave her the data under the context of having hundreds of customers but expecting to expand their customer base by a factor of ten. As long as the fabricated data was statistically similar as the original data, it would make the company happy while standing up to scientific scrutiny.
The results were compiled on Boise State’s Onyx cluster. The exact runtimes are not important here, since Onyx will perform differently from other clusters. The runtime information is useful only when compared to that of the other systems when run on the exact same system using the exact same computing power.
According to Hollingsworth, MapReduce was the big winner. “From these results, it is evident that MapReduce outperforms MySQL and Hive by a dramatic margin.” What was particularly interesting was that for MapReduce and Hive, the runtimes remained relatively constant from 500 to 20,000 accounts (or 235 MB to 9 GB) while MySQL’s runtimes rose at some non-linear curve whose function is not discernible. Hollingsworth did not offer an explanation for this since her focus was simply on which performed best.
Hive’s average runtime raised only slightly, from 535 seconds for 500 accounts to 583 seconds for 20,000 accounts, a runtime increase of 9% for a data increase of 4000%. MapReduce remained relatively constant as well, going from 81.1 seconds to 88.9 seconds, an increase of 9.6%. Meanwhile, MySQL’s runtime increased dramatically. MySQL actually outperformed MapReduce until about the 2,500 account benchmark, analyzing 500 accounts in only 4.2 seconds and 1000 accounts in 13.8 seconds. MySQL also proves better than Hive until somewhere between 5,000 and 10,000 accounts. The elongation in runtime is remarkable in MySQL, as it grows nearly exponentially until the 10,000 account mark.
So what does this all mean? In a basic sense, it means that Hollingsworth recommended MapReduce to the small software company for their predictive analytics. Indeed, Hollingsworth wrote, “our results indicate that MapReduce is the best candidate for this case study. Therefore, we recommend that CompanyX deploy this type of distributed warehousing solution for their BI predictive analytics.”
In a broader sense, it means smaller companies can get away with running MySQL if they do not expect their customer base to exceed a thousand. However, MySQL’s performance dramatically decreases as more data is dumped into it. MapReduce and Hive, on the other hand, remain relatively constant up to ten gigabytes.
It should be noted that, with the exception of MySQL working with only a few hundred megabytes, none of these results happen in what would be considered real time. Some businesses may demand that these solutions operate in real time. However, the implication here is that these small businesses are in no position to demand that. Besides, 88.9 seconds, or about a minute and a half, is fairly quick, certainly quick enough to have time left over to make a decision on which customers’ information should be sent to collection agencies. Heck, some may even consider 88.9 seconds real time.
Either way, this independent case study outlined the performances of MySQL, MapReduce, and Hive on a relatively small scale for relatively small businesses and declared MapReduce the winner of the resulting small business predictive analytics race.