Spark核心与任务并发

Question

I've a very basic question about spark. 我有一个关于火花的非常基本的问题。 I usually run spark jobs using 50 cores. 我通常使用50个内核运行spark作业。 While viewing the job progress, most of the times it shows 50 processes running in parallel (as it is supposed to do), but sometimes it shows only 2 or 4 spark processes running in parallel. 在查看作业进度时，大多数情况下它会显示50个并行运行的流程（这是应该做的），但有时它只显示2个或4个并行运行的spark流程。 Like this: 像这样：

[Stage 8:================================>                      (297 + 2) / 500]

The RDD's being processed are repartitioned on more than 100 partitions. 正在处理的RDD被repartitioned到100多个分区上。 So that shouldn't be an issue. 因此，这不应该是一个问题。

I have an observations though. 我有一个观察。 I've seen the pattern that most of the time it happens, the data locality in SparkUI shows NODE_LOCAL , while other times when all 50 processes are running, some of the processes show RACK_LOCAL . 我已经看到一种模式，在大多数情况下，SparkUI中的数据局部性显示NODE_LOCAL ，而在其他所有50个进程都在运行时，某些进程则显示RACK_LOCAL 。 This makes me doubt that, maybe this happens because the data is cached before processing in the same node to avoid network overhead, and this slows down the further processing. 这使我感到怀疑，也许是因为数据在处理之前在同一节点中被缓存以避免网络开销而发生，并且这减慢了进一步处理的速度。

If this is the case, what's the way to avoid it. 如果是这种情况，如何避免这种情况。 And if this isn't the case, what's going on here? 如果不是这种情况，这是怎么回事？

Answer 1

After a week or more of struggling with the issue, I think I've found what was causing the problem. 经过一个星期或更长时间的苦苦挣扎之后，我想我已经找到了导致问题的原因。

If you are struggling with the same issue, the good point to start would be to check if the Spark instance is configured fine. 如果您遇到相同的问题，最好的方法是检查Spark实例配置是否正确。 There is a great cloudera blog post about it. cloudera上有一篇很棒的博客文章。

However, if the problem isn't with configuration (as was the case with me), then the problem is somewhere within your code. 但是，如果问题不在于配置（就像我这样），那么问题就出在代码内。 The issue is that sometimes due to different reasons (skewed joins, uneven partitions in data sources etc) the RDD you are working on gets a lot of data on 2-3 partitions and the rest of the partitions have very few data. 问题是，有时由于不同的原因（连接歪斜，数据源中的分区不均匀等），您正在使用的RDD在2-3个分区上获取大量数据，而其余分区中的数据很少。

In order to reduce the data shuffle across the network, Spark tries that each executor processes the data residing locally on that node. 为了减少整个网络上的数据混乱，Spark尝试让每个执行程序处理该节点上本地驻留的数据。 So, 2-3 executors are working for a long time, and the rest of the executors are just done with the data in few milliseconds. 因此，2-3个执行器工作了很长时间，而其余的执行器仅用了几毫秒的时间就完成了数据。 That's why I was experiencing the issue I described in the question above. 这就是为什么我遇到上面问题中描述的问题。

The way to debug this problem is to first of all check the partition sizes of your RDD. 调试此问题的方法是首先检查RDD的分区大小。 If one or few partitions are very big in comparison to others, then the next step would be to find the records in the large partitions, so that you could know, especially in the case of skewed joins, that what key is getting skewed. 如果一个或几个分区相对于其他分区而言很大，那么下一步将是在大分区中查找记录，这样您就可以知道，尤其是在偏斜联接的情况下，哪个键正偏斜。 I've wrote a small function to debug this: 我写了一个小函数来调试它：

from itertools import islice
def check_skewness(df):
    sampled_rdd = df.sample(False,0.01).rdd.cache() # Taking just 1% sample for fast processing
    l = sampled_rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
    max_part = max(l,key=lambda item:item[1])
    min_part = min(l,key=lambda item:item[1])
    if max_part[1]/min_part[1] > 5: #if difference is greater than 5 times
        print 'Partitions Skewed: Largest Partition',max_part,'Smallest Partition',min_part,'\nSample Content of the largest Partition: \n'
        print (sampled_rdd.mapPartitionsWithIndex(lambda i, it: islice(it, 0, 5)     if i == max_part[0] else []).take(5))
    else:
        print 'No Skewness: Largest Partition',max_part,'Smallest Partition',min_part

It gives me the smallest and largest partition size, and if the difference between these two is more than 5 times, it prints 5 elements of the largest partition, to should give you a rough idea on what's going on. 它为我提供了最小和最大的分区大小，并且如果这两者之间的差异超过5倍，它将打印出最大分区的5个元素，以使您大致了解正在发生的事情。

Once you have figured out that the problem is skewed partition, you can find a way to get rid of that skewed key, or you can re-partition your dataframe, which will force it to get equally distributed, and you'll see now all the executors will be working for equal time and you'll see far less dreaded OOM errors and processing will be significantly fast too. 一旦确定问题出在偏斜的分区上，就可以找到一种方法来消除该偏斜的密钥，或者可以对数据帧进行重新分区，这将迫使它均匀分布，现在您将看到所有执行者将在相同的时间工作，您将看到可怕的OOM错误更少，并且处理速度也将大大加快。

These are just my two cents as a Spark novice, I hope Spark experts can add some more to this issue, as I think a lot of newbies in Spark world face similar kind of problems far too often. 这些只是我作为Spark新手的两分钱，我希望Spark专家可以为这个问题添加更多内容，因为我认为Spark世界中的许多新手经常会遇到类似的问题。

Spark核心与任务并发

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-12-15 19:10:51

Spark核心与任务并发

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-12-15 19:10:51

解决方案1
3 已采纳 2016-12-15 19:10:51