简体   繁体   English

在Apache Spark生产方案中处理偏斜数据

[英]Handling Skew data in apache spark production scenario

Can anyone explain how the skew data is handled in production for Apache spark? 谁能解释在Apache Spark生产中如何处理偏斜数据?

Scenario: 场景:

We submitted the spark job using "spark-submit" and in spark-ui it is observed that few tasks are taking long time which indicates presence of skew. 我们使用“ spark-submit”提交了spark作业,在spark-ui中观察到很少有任务花费很长时间,这表明存在偏斜。

Questions: 问题:

(1) What steps shall we take(re-partitioning,coalesce,etc.)? (1)我们应该采取什么步骤(重新分区,合并等)?

(2) Do we need to kill the job and then include the skew solutions in the jar and re-submit the job? (2)我们是否需要取消工作,然后将歪斜的解决方案包含在罐子中并重新提交工作?

(3) Can we solve this issue by running the commands like (coalesce) directly from shell without killing the job? (3)我们能否通过直接从shell运行诸如(coalesce)之类的命令而不会杀死工作来解决此问题?

Data skews a primarily a problem when applying non-reducing by-key (shuffling) operations. 当应用非减少键(改组)操作时,数据歪斜是一个主要的问题。 The two most common examples are: 两个最常见的示例是:

  • Non-reducing groupByKey ( RDD.groupByKey , Dataset.groupBy(Key).mapGroups , Dataset.groupBy.agg(collect_list) ). groupByKeyRDD.groupByKeyDataset.groupBy(Key).mapGroupsDataset.groupBy.agg(collect_list) )。
  • RDD and Dataset joins . RDDDataset joins

Rarely, the problem is related to the properties of the partitioning key and partitioning function, with no per-existent issue with data distribution. 很少,该问题与分区键和分区功能的属性有关,而数据分配没有任何问题。

// All keys are unique - no obvious data skew
val rdd = sc.parallelize(Seq(0, 3, 6, 9, 12)).map((_, None))

// Drastic data skew
rdd.partitionBy(new org.apache.spark.HashPartitioner(3)).glom.map(_.size).collect
// Array[Int] = Array(5, 0, 0)

What steps shall we take(re-partitioning,coalesce,etc.)? 我们应该采取什么步骤(重新分区,合并等)?

Repartitioning (never coalesce ) can help you with the the latter case by 重新分区(从不coalesce )可以通过后一种情况为您提供帮助

  • Changing partitioner. 更换分区器。
  • Adjusting number of partitions to minimize possible impact of data (here you can use the same rules as for associative arrays - prime number and powers of two should be preferred, although might not resolve the problem fully, like 3 in the example used above). 调整分区数以最大程度地减少数据影响(在这里您可以使用与关联数组相同的规则-首选使用素数和2的幂,尽管可能无法完全解决问题,例如上述示例中的3)。

The former cases typically won't benefit from repartitioning much, because skew is naturally induced by the operation itself. 前一种情况通常不会从大量分区中受益,因为操作本身会自然地引起偏斜。 Values with the same key cannot be spread multiple partitions, and non-reducing character of the process, is minimally affected by the initial data distribution. 具有相同键的值不能扩展到多个分区,并且初始数据分布对过程的非归约性影响最小。

These cases have to be handled by adjusting the logic of your application. 这些情况必须通过调整应用程序的逻辑来处理。 It could mean a number of things in practice, depending on the data or problem: 在实践中,这可能意味着很多事情,具体取决于数据或问题:

  • Removing operation completely. 完全删除操作。
  • Replacing exact result with an approximation. 用近似值代替精确结果。
  • Using different workarounds (typically with joins), for example frequent-infrequent split , iterative broadcast join or prefiltering with probabilistic filter (like Bloom filter). 使用不同的解决方法(通常使用联接),例如, 频繁-不频繁拆分迭代广播联接或使用概率过滤器(例如Bloom过滤器)进行预过滤。

Do we need to kill the job and then include the skew solutions in the jar and re-submit the job? 我们是否需要取消该工作,然后将歪斜的解决方案包含在罐中并重新提交该工作?

Normally you have to at least resubmit the job with adjust parameters. 通常,您至少必须使用调整参数重新提交作业。

In some cases (mostly RDD batch jobs) you can design your application, to monitor task execution and kill and resubmit particular job in case of possible skew, but it might hard to implement right in practice. 在某些情况下(大多数是RDD批处理作业),您可以设计应用程序,以监视任务执行,并在可能出现歪斜的情况下终止并重新提交特定作业,但是在实践中可能很难实现。

In general, if data skew is possible, you should design your application to be immune to data skews. 通常,如果可能出现数据偏斜,则应将应用程序设计为不受数据偏斜的影响。

Can we solve this issue by running the commands like (coalesce) directly from shell without killing the job? 我们是否可以通过直接从shell运行诸如(coalesce)之类的命令而不会杀死工作来解决此问题?

I believe this is already answered by the points above, but just to say - there is no such option in Spark. 我相信以上几点已经回答了这一问题,但仅是说-Spark中没有这样的选择。 You can of course include these in your application. 您当然可以将它们包括在您的应用程序中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM