June 30, 2022

It’s Not ‘Mobile Spark,’ But It’s Close


On April 1, 2015, Apache Spark PMC member Reynold Xin wrote a compelling blog detailing plans to deliver a mobile version of Spark. It was all a joke, of course: Spark was a heavy bit of code designed for distributed systems (although the Wall Street Journal apparently did bite). But with this week’s launch of Spark Connect, the mobile vision of mobile Spark is actually back in play–but with an interesting twist.

Data applications have escaped the data center, and now Spark is about to follow suit with Spark Connect, according to Xin, a co-founder and chief architect of Databricks, which is hosting its first in-person Data + AI Summit in three years this week in San Francisco.

“Spark is often associated with big compute, large clusters, thousands of machines, big applications,” Xin said during his keynote address at the Moscone Center on June 28. “But the reality is data applications don’t just live in data centers anymore. They can be everywhere.”

Data applications can be found in interactive environments, like notebooks and IDEs, Xin said. “They can happen in Web applications,” he said. “They can happen in edge devices,” such as Raspberry Pis and even your iPhone.

While Spark has become a ubiquitous number-cruncher on massive clusters with thousands of nodes, it remains mostly cut off from the data application revolution occurring on the edge. Why? Xin explained that it’s a result of Spark’s makeup.

Reynold Xin is the top contributor to the Apache Spark project and is the chief architect at Databricks

“You zoom in, you realize Spark has a monolithic driver,” Xin said. This monolithic driver runs not only the application code, but the Spark code as well. The combination of the customers’ application code along with the Spark components, such as optimizers and the execution engine, makes it difficult to run Spark on smaller devices. Spark’s Java roots and its hefty appetite for memory in the JVM also play a role.

But there are potential workarounds. Why not just keep Spark on the server, and server data to the client via SQL? That could work, Xin said, but something would be lost in the translation.  “SQL doesn’t actually capture the full expressiveness of Spark,” he said. “It’s just a much more limited subset.”

Another possible route could be to piggyback alongside products like Jupyter notebooks, which come with mobile runtimes that connect to backend clusters. But the potential for JVM code conflicts is just too great.

“You run into a whole suite of multi-tenancy operational issues,” Xin said. “The fundamental issue here is a lack of isolation. One application is consuming too much memory and not behaving.”

The Spark community has navigated around those thorny issues with Spark Connect, a new Spark component that enables applications running anywhere to leverage the full power of Spark, Xin said.

Spark Connect introduced a decoupled architecture to Spark development, Xin said. A core component of Spark Connect is a client-server protocol that is used to send unresolved query plans created on the data application running on an edge device and Spark itself running on the server, which serves the data. The protocol is, which is based on gRPC and Apache Arrow, can work with any language supported by Spark.

When the server running Spark receives the unresolved query plan, it executes it using standard query optimizing execution pipeline, and then sends the results back to the data application, Xin said.

It works similarly to how SQL strings are sent over the JDBC or ODBC protocols, Xin said, but with one important distinction. “There’s so much more than just sending SQL because you have the full power of the dataframe API in Spark,” he said.

“So with this protocol and with thin clients…now you can actually embed Spark in all of these devices, including ones with very limited computational power,” Xin continued. “Such devices can actually drive and orchestrate all the programs and actually offload the heavy lifting execution over to the cloud in the back.”

This architecture mitigates a lot of operational and memory issues that would arise if one tried to run the full Spark environment on a mobile driver, Xin said. Because Spark is contained in its own client, it reduces the likelihood that it would impact other applications. It also simplifies debugging and upgrades, he said.

“Spark Connect…in my mind is the widest change to the project since the project’s inception,” Xin said.

And that’s no joke.

Related Items:

Databricks Scores ACM SIGMOD Awards for Spark and Photon

Databricks Opens Up Its Delta Lakehouse at Data + AI Summit

Apache Spark Is Great, But It’s Not Perfect