TeraSort equivalent for Apache Spark!

These days I was trying to run Terasort on Spark, in order to compare it’s performance with Hadoop. Unfortunately the Spark code package does not ship with the terasort source code. As a result you need to write it yourself, But it’s not what makes me happy. I need to do comparison and it’s more appropriate to use a code which is accessible and used by a huge community. This way my comparison would be valid and dependable.

One thing that made me so eager to test Terasort on Spark, was the DataBricks report regarding their success on sorting benchmark challenge. In one part it’s mentioned they were able to fully saturate a 10 Gbps network link, which is fascinating. I am so eager to see if we can reproduce it for smaller scale, let’s say a small cluster.

Now, the problem is how can I get a valid sorting code for spark. If you dig into Github, you can see there are two versions written by two members of DataBricks community. I’m mainly focused on the contribution of Evan Higgs. It has all different steps: (1) Teragen, (2) Terasort, and (3) Teravalidate. The other one combined first and second functionalities, which is not an appropriate solutions. Now the question is where is this code, and how can I compile and use the code beside the Apache Spark.

The code is hosted on Github, and it’s located in Evan’s page. There are two versions of the code. One is embedded inside an spark project, which as a result you are gonna have everything all in one. The other one only has Terasort codes, and you can compile and install them separately. I first went for the first option. The codes are inside example folder, under scala package. I had difficulty in compiling the code, regarding the incompatibility of Google Guava. Then I switched to the second alternative.

Here you can find the source code: https://github.com/ehiggs/spark-terasort

Now download and unzip it under Spark folder. Go inside the spark-terasort and type: “mvn install

now you can first run TeraGen using this command, cause you first need to have data:

./bin/spark-submit –class org.apache.spark.examples.terasort.TeraGen –master spark://{masterIP}:{masterPort} examples/target/spark-examples_2.10-1.3.0.jar 12G hdfs://{namenodeIP}:{namenodePort}/sparktera

Now you can see the generated data in you HDFS. It’s time to sort all these data. Now you can type below command in order to sort the generated data:

./bin/spark-submit –class org.apache.spark.examples.terasort.TeraSort –master spark://{masterIP}:{masterPort} examples/target/spark-examples_2.10-1.3.1.jar hdfs://{namenodeIP}:{namenodePort}/sparktera hdfs://{namenodeIP}:{namenodePort}/sparkteraout

Now it’s all ready. Looking at the TeraSort, you can see how simple it is. They just execute sortByKey function on the RDD. Every other steps is just done by Spark internal core. 

There are some issues that would remain here:

  1. DataBricks community mentions a technique called sort-based shuffle. I still don’t understand how exactly it works and why it improves the performance. This would be in my TO-DO list to check this out.
  2. One interesting thing, is separating the network module, in order to bypass the GC phases. This makes the promise for saturating the network, which I’m really eager to see HOW?

I would add more technical information to this article in the near future.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s