How to handle big reference data in Spark

Question

I have big data set (lets say 4gb) that is used as a reference source to process another big data set (100-200gb) I have cluster for 30 executors to do that on 10 nodes. So for every executor I have own jvm, right? Everytime it loads whole reference dataset. And it takes long time and inefficient. If there are any good approach to handle this? Currently I am storing data at s3 aws and run everything with emr. May be it is good to use more elegant storage that I could query on fly, or spin up for example redis as part of my cluster and push data their and than query it?

UPD1:

Flat data is gziped csv files on S3 partioned by 128Mb.
It is read into Dataset (coalesce is for reducing number of partitions in order to spread data across less nodes)


    val df = sparkSession.sqlContext.read.format("com.databricks.spark.csv")
          .option("header", "false")
          .schema(schema)
          .option("delimiter", ",")
          .load(path)
          .coalesce(3)
          .as[SegmentConflationRef]

Than I need to convert flat data into ordered grouped list and put to some key value storage, in-memory map in that case.

    val data: Seq[SegmentConflationRef] = ds.collect()
    val map = mutable.Map[String, Seq[SegmentConflationRef]]()
    data.groupBy(_.source_segment_id).map(c => {
      map += (c._1 -> c._2.sortBy(_.source_start_offset_m))
    })

After that I am going to do lookup from another Dataset.

So in that case I want to have reference map to be copied in every executor. One problem is how to broadcast such big map across nodes, or what should be better approach? Probably not using Spark from the begining and load data locally from hdfs in every executor?

Answer 1

Sadly, Apache Spark is not a plug and play solution to any problem.

First, you must have general understanding of how Apache Spark works. Then, you have to use Spark UI for monitoring and seeing why your process is not optimal. The official documents linked on this page are usually a good start:

https://spark.apache.org/docs/latest/index.html

What's really useful is learning to use Spark Web UI! Once you understand what every piece of information means there - you know where's your application bottleneck. This article covers basic components of Spark Web UI: https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

How to handle big reference data in Spark

Question

1 answers

solution1
2 2019-05-07 06:53:34

How to handle big reference data in Spark

Question

1 answers

solution1 2 2019-05-07 06:53:34

solution1
2 2019-05-07 06:53:34