简体   繁体   English

如果初始 RDD 不为空,为什么 RDD.groupBy 会返回一个空的 RDD?

[英]Why does RDD.groupBy return an empty RDD if the initial RDD wasn't empty?

I have an RDD that I've used to load binary files.我有一个用于加载二进制文件的 RDD。 Each file is broken into multiple parts and processed.每个文件被分成多个部分并进行处理。 After the processing step, each entry is:在处理步骤之后,每个条目是:

(filename, List[Results])

Since the files are broken into several parts, the filename is the same for several entries in the RDD.由于文件被分成几个部分,因此 RDD 中几个条目的文件名是相同的。 I'm trying to put the results for each part back together using reduceByKey.我正在尝试使用 reduceByKey 将每个部分的结果重新组合在一起。 However, when I attempt to run a count on this RDD it returns 0:但是,当我尝试对这个 RDD 进行计数时,它返回 0:

val reducedResults = my_rdd.reduceByKey((resultsA, resultsB) => resultsA ++ resultsB)
reducedResults.count() // 0

I've tried changing the key it uses with no success.我尝试更改它使用的密钥但没有成功。 Even with extremely simple attempts to group the results I don't get any output.即使非常简单地尝试对结果进行分组,我也没有得到任何输出。

val singleGroup = my_rdd.groupBy((k, v) => 1) 
singleGroup.count() // 0

On the other hand, if I simply collect the results, then I can group them outside of Spark and everything works fine.另一方面,如果我只是收集结果,那么我可以将它们分组到 Spark 之外,一切正常。 However, I still have additional processing that I need to do on the collected results, so that isn't a good option.但是,我仍然需要对收集的结果进行额外的处理,因此这不是一个好的选择。

What could cause the groupBy/reduceBy commands to return empty RDDs if the initial RDD isn't empty?如果初始 RDD 不为空,什么会导致 groupBy/reduceBy 命令返回空 RDD?

Turns out there was a bug in how I was generating the Spark configuration for that particular job.事实证明,我为该特定作业生成 Spark 配置的方式存在错误。 Instead of setting the spark.default.parallelism field to something reasonable, it was being set to 0.不是将spark.default.parallelism字段设置为合理的值,而是将其设置为 0。

From the Spark documentation on spark.default.parallelism :来自spark.default.parallelism的 Spark 文档:

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.当用户未设置时,由 join、reduceByKey 和 parallelize 等转换返回的 RDD 中的默认分区数。

So while an operation like collect() worked perfectly fine, any attempt to reshuffle the data without specifying the number of partitions gave me an empty RDD.因此,虽然像collect()这样的操作工作得很好,但任何在不指定分区数的情况下重新整理数据的尝试都会给我一个空的 RDD。 That'll teach me to trust old configuration code.这将教会我相信旧的配置代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM