Spark示例程序运行非常慢

Question

我尝试使用Spark处理简单图形问题。 我在Spark源文件夹中找到了一个示例程序：transiveive_closure.py，该程序在不超过200个边和顶点的图形中计算传递闭包。 但是在我自己的笔记本电脑上，它可以运行10分钟以上，并且不会终止。 我使用的命令行是：spark-submittransive_closure.py。

我不知道为什么即使计算出这么小的传递闭包结果，spark仍然这么慢？ 这是常见的情况吗？ 我想念任何配置吗？

该程序如下所示，可以在其网站的spark安装文件夹中找到。

from __future__ import print_function

import sys
from random import Random

from pyspark import SparkContext

numEdges = 200
numVertices = 100
rand = Random(42)


def generateGraph():
    edges = set()
    while len(edges) < numEdges:
        src = rand.randrange(0, numEdges)
        dst = rand.randrange(0, numEdges)
        if src != dst:
            edges.add((src, dst))
    return edges


if __name__ == "__main__":
    """
    Usage: transitive_closure [partitions]
    """
    sc = SparkContext(appName="PythonTransitiveClosure")
    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    tc = sc.parallelize(generateGraph(), partitions).cache()

    # Linear transitive closure: each round grows paths by one edge,
    # by joining the graph's edges with the already-discovered paths.
    # e.g. join the path (y, z) from the TC with the edge (x, y) from
    # the graph to obtain the path (x, z).

    # Because join() joins on keys, the edges are stored in reversed order.
    edges = tc.map(lambda x_y: (x_y[1], x_y[0]))

    oldCount = 0
    nextCount = tc.count()
    while True:
        oldCount = nextCount
        # Perform the join, obtaining an RDD of (y, (z, x)) pairs,
        # then project the result to obtain the new (x, z) paths.
        new_edges = tc.join(edges).map(lambda __a_b: (__a_b[1][1], __a_b[1][0]))
        tc = tc.union(new_edges).distinct().cache()
        nextCount = tc.count()
        if nextCount == oldCount:
            break

    print("TC has %i edges" % tc.count())

    sc.stop()

Answer 1

有许多原因可能导致此代码在您的计算机上表现不佳，但是很可能这只是Spark迭代时间中描述的问题的另一个变体，当使用join时，它呈指数增长。 检查是否确实存在的最简单方法是在spark.default.parallelism上提供spark.default.parallelism参数：

bin/spark-submit --conf spark.default.parallelism=2 \
  examples/src/main/python/transitive_closure.py

如果没有其他限制，则SparkContext.union ， RDD.join和RDD.union子级的分区数设置RDD.join RDD.union中分区的总数。 通常，这是一种期望的行为，但是如果迭代应用，它可能会变得效率极低。

Answer 2

用法说命令行是

transitive_closure [partitions]

设置默认并行度只会帮助每个分区中的联接，而不是工作的初始分配。

我要说应该使用更多的分区。 设置默认的并行性可能仍然有帮助，但是您发布的代码会显式设置数字（传递的参数为2或2，以较大者为准）。 绝对最小值应该是Spark可用的核心，否则您的工作始终少于100％。

Spark示例程序运行非常慢

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-02-23 01:06:29

解决方案2
0 2016-02-23 05:04:42

Spark示例程序运行非常慢

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-02-23 01:06:29

解决方案2 0 2016-02-23 05:04:42

解决方案1
5 已采纳 2016-02-23 01:06:29

解决方案2
0 2016-02-23 05:04:42