Step aside, MapReduce. You have had a good run, but today’s big data developers are hungry for speed and simplicity. So, when it comes to picking a processing framework for new workloads to run on their Hadoop environments, they are increasingly favouring a nimble young rival called Spark.
At least that’s the message from big data suppliers who are throwing their weight behind Apache Spark, casting it as big data’s next big thing.
At the recent Spark Summit in San Francisco in June, Cloudera chief strategy officer Mike Olson spoke of the “breathtaking” growth of Spark and the profound shift in customer preference that he says his company, a Hadoop distributor, is witnessing as a result.
“Before very long, we expect that Spark will be the dominant general-purpose processing framework for Hadoop,” he said. “If you want a good, general-purpose engine these days, you’re choosing Apache Spark, not Apache MapReduce.”
Olson’s words were chosen carefully, in particular his use of the phrase “general purpose”. His point was that, while there is still plenty of room for special-purpose processing engines for Hadoop, such as Apache Solr for search or Cloudera Impala for SQL queries, the battle for supremacy among processing frameworks that developers can use to create a wide variety of analytic workloads (hence “general purpose”) is now a two-horse race – and it’s one that Spark is winning.
Quite simply, Spark niftily addresses a number of longstanding criticisms that developers have levelled at MapReduce – in particular, its high-latency, batch-mode response.
“It has been known for a very long time that MapReduce was a good workhorse for the world that Hadoop grew up in,” says Arun Murthy, founder and architect at Hortonworks.
He points out that the technology was created in the labs at Google to tackle a very specific use case: web search. More than a decade on, it has evolved – but perhaps not enough to match the enterprise appetite for big data applications.
“Its strength was that it was malleable enough to take on more use cases,” Murthy adds. “But it’s been known forever that there are use cases that MapReduce can solve, sure, but not in the most optimum manner. Just as MapReduce disrupted other technologies, it’s entirely natural that new technologies come along to disrupt or displace MapReduce.”
Speed and simplicity
So what’s so great about Spark, anyway? The main advantage it offers developers is speed. Spark applications are an order of magnitude faster than those based on MapReduce – as much as 100-fold, according to co-creator Mathei Zaharia, now CTO at Databricks, a company that offers Spark in the cloud, running not on Hadoop, but on the Cassandra database.
It is important to note that Spark can run on a variety of file systems and databases, among them the Hadoop Distributed File System, (HFDS).
What gives Spark the edge over MapReduce is that it handles most of its operations ‘in memory’, copying data sets from distributed physical storage into far faster logical RAM memory. By contrast, MapReduce writes and reads from hard drives. While disk access can be measured in milliseconds to access 1MB of data, in-memory accesses data at sub-millisecond rates. In other words, Spark can give organisations a major time-to-insight advantage.
Gartner analyst Nick Heudecker says: “One client I recently spoke to, with a very large Hadoop cluster, did a Spark pilot in which it was able to take a job from four hours [using MapReduce] to 90 seconds [using Spark].”
For many organisations, that kind of improvement is highly attractive, says Heudecker. “It means they can move from running two analyses a day on a given dataset to as many analyses as they like.”
At the Spark Summit in June, Brian Kursar, director of data science at Toyota Motor Sales USA, described the improvement his team had seen in running its customer experience analysis application. This is used to process about 700 million records taken from social media, survey data and call centre operations, in order to spot customer churn issues and identify areas of concern, so that employees can intervene where necessary.
Read more about Spark and MapReduce
Using MapReduce, the analysis took 160 hours to run. That’s almost seven days, Kursar pointed out to delegates. “By that point, [that insight] is a little too late,” he said. The same processing job, rewritten for Spark, was completed in just four hours.
Other big advantages that Spark can offer over MapReduce are its relative ease of use and its flexibility. That is hardly surprising, since Mathei Zaharia created Spark for his PhD at University of California Berkeley, in response to the limitations he had seen in MapReduce while working in summer internships at early Hadoop users, including Facebook.
“What I saw at these organisations was that users wanted to do a lot more with big data than MapReduce could support,” he says. “It had a lot of limitations – it couldn’t do interactive queries and it couldn’t handle advanced algorithms, such as machine learning. These things were a frustration, so my goal was to address them and, at the same time, I wanted to make it easier for users to adopt big data and start getting value from it.”
Most users agree that Spark is more developer-friendly, including Toyota’s Kursar, who said: “The API was significantly easier to use than MapReduce.”
A recent blog by Cloudera’s head of developer relations, Justin Kestelyn, claims that Spark’s “rich, expressive, identical” APIs for Scala, Java and Python can reduce code volume by a factor of between two and five times, when compared to MapReduce.
But this ease of use does not mean flexibility is sacrificed, as Forrester analyst Mike Gualtieri pointed out in a report published earlier this year. On the contrary, he wrote, Spark includes specialised tools that can be used separately or together to build applications.
These include Spark SQL, for analytical queries on structured, relational data; Spark Streaming, for data stream processing in near real time by using frequent ‘micro-batches’; MLib for machine learning; and GrapX for representing, as a graph, data that is connected in arbitrary ways, for example networks of social media users.
Early days
However, a significant hurdle for Spark is its relative immaturity. At financial services company Northern Trust, chief architect Len Hardy’s team are confident users of the Cloudera Hadoop distribution, employing a wide range of tools on top of their implementation including Hive (for data warehousing), Flume (for large-scale log aggregation) and Cloudera Impala (for running SQL queries).
But, right now, Hardy is holding back on using Spark in a production environment. “We’re staying away from Spark for now,” he says. “It’s a maturity issue. The technology has great promise and we will be using it, there’s no doubt about it – and we’re already using it in some proof of concepts.
“But it hasn’t been out all that long, so for our enterprise data platform, where we’re delivering data to partners and clients and they’re making business decisions on it, we need tools to be rock solid and I just don’t feel Spark is at that point yet.”
That caution is not unwarranted. All the major Hadoop suppliers are, naturally, scrambling to bolster their enterprise support for Spark, but as Gartner’s Heudecker points out: “Commercial support for Spark is almost always bundled with other data management products, but information managers and business analytics professionals must be aware that Spark’s development pace makes it challenging for bundling suppliers to constantly support the latest component versions.”
APIs and best practices are still very much a work in progress, Heudecker adds, and suppliers may struggle to support all the available components in the Spark Framework equally. Enterprise users should take great care not to deploy mission-critical applications on unsupported or partially supported features.
Cloudera’s Olson acknowledges that Spark is still a young technology. “It’s still early doors – there’s still a lot of work needed on security requirements, for example,” he says.
But, several months after the Spark Summit, he is sticking to his message that, at some point in the not-so-distant future, most new analytic applications for Hadoop will be built on Spark, not MapReduce.
“The dominant share of cycles in the average Hadoop cluster will be on Spark – and that tipping point will come sooner rather than later," Olson says. “Now, I can’t make a prediction of exactly when that will be, but I will say that some of our customers, especially in financial services and consumer goods, have already hit that tipping point. Many others are bound to follow.”