Apache Spark is an in memory database that can run on top of YARN, is seen as a much faster alternative than MapReduce in Hive (with certain claims hitting the 100x mark), and is designed to work with varying data sources both unstructured and structured.
Apache Tez has in memory processing capabilities, runs on top of YARN, is seen as a much faster alternative than MapReduce in Hive (with certain claims hitting the 100x mark), and is designed to work with varying data sources both unstructured and structured.
At first glance, very similar beasts! What is the difference and why use one over the other?
The Story so far …
Hive was created to handle big amounts of data, primarily via batch data processing. it worked, and it worked well, but the MapReduce mechanism worked relatively slowly. Within the context of its design goals of being a data warehouse where SQL can be used against stored data, wether already structured, or by projecting structure upon unstructured data, it achieved these goals.
Speed of data processing, however, is rarely a bad thing. To improve the execution times of jobs,, Hive enables execution via three execution motors, the traditional MapReduce, and now Tez, or Spark.
Apache Tez, an Apache project initially developed and championed by Hortonworks, was designed to be a highly customisable distributed framework, to plug easily into existing Hive workflows, and to be faster than MapReduce. One of the main facilitators in this was the use of a more flexible DAG to enable streamlined processing. In standard Hive MapReduce, it was very common for complex workloads to involve chains of MapReduce functions to progressively drill down into a workload. With Tez, any MapReduce expressing one SQL query can be expressed with one DAG.
The use of one DAG and a single job enables Tez to apply various techniques (including in memory calculations as opposed to writing to storage after each step, where possible) to remove the processing bottlenecks of chained MapReduce functions. The improvements in speed are at least 10 times or more, with some claims of similar speeds to Spark. In many cases, batch jobs that might have taken 20 hours, can be measured in minutes with Tez.
One of the main design goals of Tez was maximum compatibility with existing solutions that are based upon MapReduce. To this end, the API calls are very similar, and solutions such as Hive and Pig that can generate MapReduce code can also generate Tez.
One of the big advantages of Tez on Hive is that by simply selecting it as the execution motor, you can speed up existing workloads as it has full backwards compatibility for older MapReduce jobs. This has seen it become a very widely adopted execution motor in production environments.
Compared to Spark, Tez can currently only run on YARN. More precisely, you don’t need to install Tez on a cluster, you just need to put some Tez jars into a HDFS directory and point to it. It is also fault tolerant due to it’s YARN base, if you lose nodes your code will still run across the others nodes, just more slowly. For its part, Spark can run standalone, which might be an advantage to Spark if you have no need for Hive, but might be an advantage for Tez in certain cases as no need to double up memory and run separate infrastructures if you decide on a standalone Spark approach. Take your pick.
Apache Spark is an Apache project that provides a distributed in memory database that is built for speed and is, like Tez, much faster than standard MapReduce.
It provides interactive shell capabilities with R, Scala, or Python, and APIs for app development in other languages such as Java. R integration, which opens to an existing world of statistical libraries for graphing, maths, etc, is huge for data scientists. The usability and accessibility of Spark for use cases of BI, or application development, are a definite strength for Spark.
For integration with Hive, you can readily open a hive context, then execute Spark SQL against the data enabling full compatibility with existing Hive data, and external applications can connect to Spark via APIs and industry standards such as JDBC or ODBC. This means that Spark running on YARN and Hive can add in memory database speed to existing data lakes. Or, you can pull away Spark and use it stand alone.
It is important to note the in memory aspect of Spark. This is a key factor that gives it its speed (memory operations being faster than on disk), but also comes with a cost. To run very large data sets, you need lots of memory. Spark optimises memory over the cluster for jobs, and RDDs can be defined as in memory or on disk or both, for extra configuration options, but to get the most bang you need to pay up the memory bucks.
Spark also offers Spark Streaming, enabling high throughput and fault tolerant data processing on live streams of data. It also includes Machine Learning, and GraphX for graph-parallel algorithms. These functions are not offered by Tez, so from a functional requirement if you require them, Spark is the way to go. (Having said that, streaming solutions are very much a la mode at this time – see Kafka streams, Apache Flink which runs on Tez; so if streaming is your main requirement, that is a different analysis in itself).
Which to use?
Traditionally, Hive was designed for low latency batch processing of data, and ETL like tasks.
With Tez, the low latency part is up for grabs; and Spark steps in with a lot of advanced BI and development advantages that further muddle the question.
Each use case is different. On a pure speed basis, there are arguments both ways; and existing infrastructure and “legacy” MapReduce can impact the decision either way too. There is, sadly, no one size fits all solution so the best thing is to do your benchmarks, examine your requirements, and choose the best fit for your needs.