Spark performance analysis for joins

Question

Input data

I have two tables exported from MySQL as csv files.

Table 1 size on disk : 250 MB Records : 0.7 Million

Table 2 size on disk : 350 MB Records : 0.6 Million

Update for code

import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val table-one = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("example-input-files/table-1-data.csv”)
table-one.registerTempTable(“table-one”)
val table-two = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("example-input-files/table-2-data.csv”)
table-two.registerTempTable(“table”-two)
sqlContext.cacheTable(“table-one”)
sqlContext.cacheTable(“table-two”)
val result = sqlContext.sql("SELECT table-one.ID,table-two.ID FROM table-one LEFT JOIN table-two ON table-one.ID = table-two.ID")
result.take(2).foreach(println)

The Spark Job

Read the two csv files using Databricks CSV lib and register them as tables.
Perform a left join on both using a common column, a typical left join in relational db speak.
Print the top two results,since printing on console itself will consume time.

This takes 30 seconds on the whole.I am running on a single machine with enough memory so that both the files can fit in (Its 600Mb after all).

There were two ways that I ran the job.

Run the job as a whole ie load all the csv, run the joins and then print the results
Second way I first ran and cached the tables in memory using sqlContext.cacheTable("the_table")

After caching I found that the join operation itself took 8 seconds to complete.

Is this time reasonable ? I am guessing its not and there are lot of optimisations that can be done to speed up the query.

Optimizations that I see

Putting the data into HDFS instead of local disk. Will this speed up the retrieval ?
Running on a cluster,I am guessing that this will not be fast since the data can fit into memory and sequential will be faster.
Will modelling the data and using cassandra will be faster?
I am using plain SQL to join, will a RDD join be faster ?

Is there any other way to do things better ?

Answer 1

As mentioned by the commenters, Spark is designed for distributed computing. The overhead alone for all the initialization and scheduling is enough to make Spark seem slow compared to other PL's when working locally on small(ish) data.

Running on a cluster,I am guessing that this will not be fast since the data can fit into memory and sequential will be faster.

The executors will actually work on its local copy of data in memory for as long as your code performs narrow transformations so this is not exactly correct. Your code performs a join, however, which is a wide transformation - meaning the blocks will have to be shuffled across the network. Keep this in mind. Wide transformations are expensive so as much as possible, put them at the end of a DAG. But again, your data is small enough that you might not see the benefits.

Another thing is that if you have Hive then you could consider storing the data in a table partitioned on your join column.

Spark performance analysis for joins

Question

1 answers

solution1
1 2015-12-03 02:54:42

Spark performance analysis for joins

Question

1 answers

solution1 1 2015-12-03 02:54:42

solution1
1 2015-12-03 02:54:42