如何在Spark中处理大参考数据

Question

I have big data set (lets say 4gb) that is used as a reference source to process another big data set (100-200gb) I have cluster for 30 executors to do that on 10 nodes. 我有一个大数据集（比如说4gb），可以用作处理另一个大数据集（100-200gb）的参考源。我有30个执行者的集群，可以在10个节点上执行此操作。 So for every executor I have own jvm, right? 所以对于每个执行者我都有自己的jvm，对吗？ Everytime it loads whole reference dataset. 每次加载整个参考数据集。 And it takes long time and inefficient. 而且它需要很长时间并且效率很低。 If there are any good approach to handle this? 是否有什么好的方法可以解决这个问题？ Currently I am storing data at s3 aws and run everything with emr. 目前，我正在s3 aws上存储数据，并使用emr运行所有内容。 May be it is good to use more elegant storage that I could query on fly, or spin up for example redis as part of my cluster and push data their and than query it? 使用更优雅的存储（我可以动态查询）或将例如redis作为群集的一部分并推送数据而不是查询，这可能是个好习惯吗？

UPD1: UPD1：

Flat data is gziped csv files on S3 partioned by 128Mb. 平面数据是在S3上以gzip压缩的csv文件，并以128Mb分配。
It is read into Dataset (coalesce is for reducing number of partitions in order to spread data across less nodes) 它被读入数据集（coalesce用于减少分区数，以便将数据分散到更少的节点上）


    val df = sparkSession.sqlContext.read.format("com.databricks.spark.csv")
          .option("header", "false")
          .schema(schema)
          .option("delimiter", ",")
          .load(path)
          .coalesce(3)
          .as[SegmentConflationRef]

Than I need to convert flat data into ordered grouped list and put to some key value storage, in-memory map in that case. 比我需要将平面数据转换为有序的分组列表并放入一些键值存储（在这种情况下为内存映射）。

    val data: Seq[SegmentConflationRef] = ds.collect()
    val map = mutable.Map[String, Seq[SegmentConflationRef]]()
    data.groupBy(_.source_segment_id).map(c => {
      map += (c._1 -> c._2.sortBy(_.source_start_offset_m))
    })

After that I am going to do lookup from another Dataset. 之后，我将从另一个数据集中进行查找。

So in that case I want to have reference map to be copied in every executor. 因此，在这种情况下，我希望在每个执行程序中都复制参考映射。 One problem is how to broadcast such big map across nodes, or what should be better approach? 一个问题是如何在节点之间广播如此大的地图，或者什么是更好的方法？ Probably not using Spark from the begining and load data locally from hdfs in every executor? 可能不是一开始就使用Spark，而是在每个执行程序中从hdfs本地加载数据吗？

Answer 1

Sadly, Apache Spark is not a plug and play solution to any problem. 遗憾的是，Apache Spark并不是解决任何问题的即插即用解决方案。

First, you must have general understanding of how Apache Spark works. 首先，您必须对Apache Spark的工作原理有一般的了解。 Then, you have to use Spark UI for monitoring and seeing why your process is not optimal. 然后，您必须使用Spark UI来监视和查看为什么您的过程不是最佳的。 The official documents linked on this page are usually a good start: 本页上链接的正式文档通常是一个好的开始：

https://spark.apache.org/docs/latest/index.html https://spark.apache.org/docs/latest/index.html

What's really useful is learning to use Spark Web UI! 真正有用的是学习使用Spark Web UI！ Once you understand what every piece of information means there - you know where's your application bottleneck. 一旦了解了每条信息的含义，就知道应用程序的瓶颈在哪里。 This article covers basic components of Spark Web UI: https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html 本文介绍Spark Web UI的基本组件： https : //databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

如何在Spark中处理大参考数据

问题描述

1 个解决方案

解决方案1
2 2019-05-07 06:53:34

如何在Spark中处理大参考数据

问题描述

1 个解决方案

解决方案1 2 2019-05-07 06:53:34

解决方案1
2 2019-05-07 06:53:34