Spark没有在带有二进制文件的并行Pyspark中运行RDD

Question

我是 Spark 的新手，开始用 Python 编写一些脚本。 我的理解是 Spark 并行执行转换（映射）。

def some_function(name, content):
    print(name, datetime.now())
    time.sleep(30)
    return content

config = SparkConf().setAppName("sample2").setMaster("local[*]")
filesRDD = SparkContext(conf=config).binaryFiles("F:\\usr\\temp\\*.zip")
inputfileRDD = filesRDD.map(lambda job_bundle: (job_bundle[0], some_function(job_bundle[0], job_bundle[1])))
print(inputfileRDD.collect())

上面的代码从文件夹中收集.zip文件列表并对其进行处理。 当我执行它时，我看到这是按顺序发生的。

输出

file:/F:/usr/temp/sample.zip 2020-10-22 10:42:37.089085
file:/F:/usr/temp/sample1.zip 2020-10-22 10:43:07.103317

您可以看到它在 30 秒后开始处理第二个文件。 完成第一个文件后的意思。 我的代码出了什么问题？ 为什么它不并行执行 RDD？ 你能帮我么？

Answer 1

我不知道binaryFiles方法是如何在binaryFiles分区中对文件进行分区的。 似乎与textFiles相反，它往往只创建一个分区。 让我们通过一个名为dir并包含 5 个文件的示例目录来查看它。

> ls dir
test1  test2  test3  test4  test5

如果我使用textFile ，事情会并行运行。 我不提供输出，因为它不是很漂亮，但你可以自己检查。 我们可以验证事物是否与getNumPartitions并行运行。

>>> sc.textFile("dir").foreach(lambda x: some_function(x, None))
# ugly output, but everything starts at the same time,
# except maybe the last one since you have 4 cores.
>>> sc.textFile("dir").getNumPartitions()
5

对于binaryFiles事情是不同的，出于某种原因，一切都进入了同一个分区。

>>> sc.binaryFiles("dir").getNumPartitions()
1

我什至尝试了 10k 个文件，但所有内容仍然在同一个分区中。 我相信这背后的原因是在 Scala 中， binaryFiles返回一个带有文件名的 RDD 和一个允许读取文件的对象（但不执行读取）。 因此它很快，并且产生的 RDD 很小。 因此，将它放在一个分区上是可以的。 在 Scala 中，我们因此可以在使用binaryFiles之后使用重新分区，事情会很好。

scala> sc.binaryFiles("dir").getNumPartitions
1
scala> sc.binaryFiles("dir").repartition(4).getNumPartitions
4
scala> sc.binaryFiles("dir").repartition(4)
    .foreach{ case (name, ds) => { 
        println(System.currentTimeMillis+": "+name)
        Thread.sleep(2000)
        // do some reading on the DataStream ds
    }}

1603352918396: file:/home/oanicol/sandbox/dir/test1
1603352918396: file:/home/oanicol/sandbox/dir/test3
1603352918396: file:/home/oanicol/sandbox/dir/test4
1603352918396: file:/home/oanicol/sandbox/dir/test5
1603352920397: file:/home/oanicol/sandbox/dir/test2

python 中的问题是binaryFiles实际上将文件读取到一个单独的分区上。 此外，这对我来说非常神秘，但 pyspark 2.4 中的以下代码行产生了您注意到的相同行为，但没有意义。

# this should work but does not
sc.binaryFiles("dir", minPartitions=4).foreach(lambda x: some_function(x, ''))
# this does not work either, which is strange but it would not be advised anyway
# since all the data would be read on one partition
sc.binaryFiles("dir").repartition(4).foreach(lambda x: some_function(x, ''))

然而，由于binaryFiles实际读取文件，您可以使用wholeTextFile将文件作为文本文件读取并按预期运行：

# this works
sc.wholeTextFiles("dir", minPartitions=4).foreach(lambda x: some_function(x, ''))

Spark没有在带有二进制文件的并行Pyspark中运行RDD

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-10-22 08:01:07

Spark没有在带有二进制文件的并行Pyspark中运行RDD

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-10-22 08:01:07

解决方案1
2 已采纳 2020-10-22 08:01:07