Which Programming Language Is Best for Big Data?
Nothing is quite so personal for programmers as what language they use. Why a data scientist, engineer, or application developer picks one over the other has as much to do with personal preference and their employers’ IT culture as it does the qualities and characteristics of the language itself. But when it comes to big data, there are some definite patterns that emerge.
The most important factor in choosing a programming language for a big data project is the goal at hand. If the organization is manipulating data, building analytics, and testing out machine learning models, they will probably choose a language that’s best suited for that task. If the organization is looking to operationalize a big data or Internet of Things (IoT) application, there are another set of languages that excel at that.
In the data science exploration and development phase, the most popular language today unquestionably is Python. One big reason for Python’s popularity is the plethora of tools and libraries available to help data scientists explore big data sets. Python was recently ranked the number one language by IEEE Spectrum, where it moved up two spots to beat C, Java, and C++, although Python trails these languages on the TIOBE Index. As a general purpose language, Python is also widely used outside of data science, which only adds to its usefulness.
Another popular data science language is R, which has long been a favorite of mathematicians, statisticians, and hard sciences. The SAS environment from the company of the same name continues to be popular among business analysts, while MathWorks‘ MATLAB is also widely used for the exploration and discovery phase of big data. You also can’t go far in data science without knowing some SQL, which remains a very useful language.
The choice of data science language may also be determined what notebook a data scientist is using. Jupyter is the successor to the iPython notebook, and as such is closely aligned with Python, but it also supports R, Scala, and Julia. The Apache Zeppelin notebook includes Python, Scala, and SparkSQL support.
Programmers will often opt for a different set of languages when it comes to developing production analytics and IoT apps. While they may choose Python or R during the experimental phase of the project, programmers will often rewrite the application and re-implement the machine learning algorithms using entirely different languages.
Java continues to be a very popular choice owing to the large number of Java developers in the world, as well as the fact that some popular frameworks, such as Apache Hadoop, were developed in Java. Scala, which runs inside the Java Virtual Machine (JVM), is also widely used in data science; Apache Spark was written in Scala, and Apache Flink was written in a combination of Java and Scala.
However, for some production applications, developers still favor lower-level languages that run closer to the iron. When speed and latency matter, many developers turn to C and C++ to get them what they want.
MapR Technologies developed its own big data platform, which contained a Hadoop runtime, a NoSQL database, and real-time streaming. But instead of writing its MapR-FS file system in Java, as HDFS was developed, it wrote it in C and C++. As MapR’s Senior Staff Software Engineer Smidth Panchamia explained in this MapR blog post, it’s tough to beat C and C++ for some tasks.
“Native languages like C/C++ provide a tighter control on memory and performance characteristics of the application than languages with automatic memory management,” Panchamia writes. “A well written C++ program that has intimate knowledge of the memory access patterns and the architecture of the machine can run several times faster than a Java program that depends on garbage collection. For these reasons, many enterprise developers with massive scalability and performance requirements tend to use C/C++ in their server applications in comparison to Java.”
Bloomberg uses Python for much of its data science exploratory work that goes into services delivered in the Bloomberg Terminal. But when it comes to writing the actual programs that feed data to customers in real time, it turned to C++.
“At the heart, it’s a C++ shop,” Bloomberg’s Head of Data Science Gideon Mann told Datanami last year. “Most of the time, when we’re doing data science, it’s really to build machine learning products. And because we have all of these real time latency constraints, we don’t want to use something like Python or Java, where you’re going have garbage collection. You need to be a little worried about intermediate lag. By building out everything in C++, you can deploy it and have a fair amount of latency guarantees.”
Another C++ aficionado is Dor Laor, CEO of ScyllaDB, which is a drop-in replacement for the Apache Cassandra NoSQL database. While Cassandra was written in Java, ScyllaDB was written in C++.
Laor, who also helped develop the KVM hypervisor, says lower-level languages in general are better for developing system software and databases. He points out that software giant Oracle, which controls Java, opted to write its eponymous database in C. IBM‘s DB2 was written in a combination of C and C++, he pointed out. “Even Mongo is written in C++,” he said.
By essentially rewriting Cassandra in C++ and avoiding the garbage collection associated with JVM, ScyllaDB is able to achieve orders-of-magnitude performance gains over Cassandra, Laor claimed.
“If you run Cassandra, then you need to reserve some amount [of memory] for Java,” he tells Datanami. “And you also need to reserve additional amounts for off-heap data structures that are too heavy for Java too handle. And you also need to preserve enough memory for the Linux page cache to cache to disk. Forget about performance — just to tune it, it’s a nightmare.”
ScyllaDB was developed using C++ version 17. “It’s the latest and greatest of C++, the cutting edge,” Laor says. “It allows us to use really fancy language options, but it’s also complex, so there’s a big learning curve…even the time it takes you to compile the database is very long.”
However, there are downsides to developing a database in C++, Laor admits. For starters, the increased complexity of the C++ source code means fewer developers will be able to contribute to the ScyllaDB project, which is open source. Plus, for some developers, letting the JVM handle memory gives them more time to develop better algorithms, which may be a good tradeoff.
The real-time stream analytics platform SQLstream was also developed in C++. “Not only do you get better performance from the code, but even more importantly, it’s the lack of garbage collection,” SQLstream CEO and founder Damian Black told Datanami last year.
Managing the memory itself gives SQLstream a 5x performance boost over Java, Black says. “Not only that, we have lock-free execution, which is not easy to do,” he continued. “It’s a trendy thing but it’s really hard to do. You have to have a true declarative system, which we do have. We don’t transact any of the input streams or data or window objects, unlike almost any of the other streaming platforms.”
Since Apache Hadoop was written in Java, the developers at Hortonworks use Java for many of the sub-projects and other open source products that make up the Hortonworks Data Platform (HDP). It also programs in Java for Hortonworks Data Flow (HDF), which is based on the Java-based Apache NiFi. But for IoT apps, NiFi has a secret weapon: C++.
“NiFi has a pretty cool thing called MiniFi,” Hortonworks co-founder and Chief Product Officer Arun Murthy told Datanami last year. “It’s C++ driver you throw on cellphone or a security camera. So you can collect data from IoT-ish devices, all the way [out on the edge], secured and encrypted, and move it to your enterprise data center.”
Another streaming product based on C++ is the Concord framework that came out of the ad tech world. When YieldMo had trouble getting Apache Storm (developed in Java and a JVM-compliant language called Clojure) to scale, a group of developers at the company, including Shinji Kim, decided to build their own real-time streaming system based on the MillWheel paper from Google.
The resulting Concord product – which was acquired last fall by Akamai Technologies – was written in C++ and implemented on the Mesos resource scheduler. “If you run that on Hadoop MapReduce jobs, if something fails, it definitely can cause a certain behavior, like cascading failure or a cluster-wide failure if one of your jobs doesn’t run well,” Kim told Datanami. “Or there could be an issue with the JVM where if you get high influx of traffic all of a sudden, if a GC [garbage collection] kicks in… there’s a lot of computations that you need get right.”
Before it was acquired by Apple two years ago, Turi (formerly GraphLab and Dato) developed a popular machine learning framework that included graph algorithms. While the framework as a whole was open source and has Python APIs for data scientists to develop in, the underlying machine learning engine, based in C++, remained proprietary. There was good reason for that, as Turi’s Rajat Arya explained.
“Most academic papers and almost all vendors are talking about how long to train a model,” Arya told Datanami. “It turns out you really care about how long it takes to score a model or get a prediction. The real time prediction is what’s important because that’s what’s driving the business.”
By writing the engine in C++, Turi could be ensured a certain level of performance. “Open source is a great teaching tool. It gets a lot more people plugged in,” Arya said. “But the ability to get something done in a week is much more important. Open source can’t fill that gap.”