Input data
I have two tables exported from MySQL as csv files.
Table 1 size on disk : 250 MB Records : 0.7 Million
Table 2 size on disk : 350 MB Records : 0.6 Million
Update for code
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val table-one = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("example-input-files/table-1-data.csv”)
table-one.registerTempTable(“table-one”)
val table-two = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("example-input-files/table-2-data.csv”)
table-two.registerTempTable(“table”-two)
sqlContext.cacheTable(“table-one”)
sqlContext.cacheTable(“table-two”)
val result = sqlContext.sql("SELECT table-one.ID,table-two.ID FROM table-one LEFT JOIN table-two ON table-one.ID = table-two.ID")
result.take(2).foreach(println)
The Spark Job
Read the two csv files using Databricks CSV lib and register them as tables.
Perform a left join on both using a common column, a typical left join in relational db speak.
Print the top two results,since printing on console itself will consume time.
This takes 30 seconds on the whole.I am running on a single machine with enough memory so that both the files can fit in (Its 600Mb after all).
There were two ways that I ran the job.
sqlContext.cacheTable("the_table")
After caching I found that the join operation itself took 8 seconds to complete.
Is this time reasonable ? I am guessing its not and there are lot of optimisations that can be done to speed up the query.
Optimizations that I see
Is there any other way to do things better ?
As mentioned by the commenters, Spark is designed for distributed computing. The overhead alone for all the initialization and scheduling is enough to make Spark seem slow compared to other PL's when working locally on small(ish) data.
Running on a cluster,I am guessing that this will not be fast since the data can fit into memory and sequential will be faster.
The executors will actually work on its local copy of data in memory for as long as your code performs narrow transformations so this is not exactly correct. Your code performs a join, however, which is a wide transformation - meaning the blocks will have to be shuffled across the network. Keep this in mind. Wide transformations are expensive so as much as possible, put them at the end of a DAG. But again, your data is small enough that you might not see the benefits.
Another thing is that if you have Hive then you could consider storing the data in a table partitioned on your join column.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.