I have big data set (lets say 4gb) that is used as a reference source to process another big data set (100-200gb) I have cluster for 30 executors to do that on 10 nodes. So for every executor I have own jvm, right? Everytime it loads whole reference dataset. And it takes long time and inefficient. If there are any good approach to handle this? Currently I am storing data at s3 aws and run everything with emr. May be it is good to use more elegant storage that I could query on fly, or spin up for example redis as part of my cluster and push data their and than query it?
UPD1:
val df = sparkSession.sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.schema(schema)
.option("delimiter", ",")
.load(path)
.coalesce(3)
.as[SegmentConflationRef]
val data: Seq[SegmentConflationRef] = ds.collect()
val map = mutable.Map[String, Seq[SegmentConflationRef]]()
data.groupBy(_.source_segment_id).map(c => {
map += (c._1 -> c._2.sortBy(_.source_start_offset_m))
})
So in that case I want to have reference map to be copied in every executor. One problem is how to broadcast such big map across nodes, or what should be better approach? Probably not using Spark from the begining and load data locally from hdfs in every executor?
Sadly, Apache Spark is not a plug and play solution to any problem.
First, you must have general understanding of how Apache Spark works. Then, you have to use Spark UI for monitoring and seeing why your process is not optimal. The official documents linked on this page are usually a good start:
https://spark.apache.org/docs/latest/index.html
What's really useful is learning to use Spark Web UI! Once you understand what every piece of information means there - you know where's your application bottleneck. This article covers basic components of Spark Web UI: https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.