在 Apache Spark 中刪除空的 DataFrame 分區

Question

我嘗試根據 DataFrame 在分區列x有N （假設N=3 ）個不同值的列重新分區 DataFrame ，例如：

val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x") // create dummy data

我想要實現的是通過x重新分區myDF而不產生空分區。 有沒有比這樣做更好的方法？

val numParts = myDF.select($"x").distinct().count.toInt
myDF.repartition(numParts,$"x")

（如果我不指定numParts在repartiton ，我的大多數分區是空的（如repartition創建200個分區）...）

Answer 1

我會考慮迭代df分區並在其中獲取記錄計數以查找非空分區的解決方案。

val nonEmptyPart = sparkContext.longAccumulator("nonEmptyPart") 

df.foreachPartition(partition =>
  if (partition.length > 0) nonEmptyPart.add(1))

當我們得到非空分區 ( nonEmptyPart ) 時，我們可以使用coalesce()清理空分區（檢查 coalesce() 與 repartition() ）。

val finalDf = df.coalesce(nonEmptyPart.value.toInt) //coalesce() accepts only Int type

它可能是也可能不是最好的，但是這個解決方案將避免改組，因為我們沒有使用repartition()

解決評論的示例

val df1 = sc.parallelize(Seq(1, 1, 2, 2, 3, 3)).toDF("x").repartition($"x")
val nonEmptyPart = sc.longAccumulator("nonEmptyPart")

df1.foreachPartition(partition =>
  if (partition.length > 0) nonEmptyPart.add(1))

val finalDf = df1.coalesce(nonEmptyPart.value.toInt)

println(s"nonEmptyPart => ${nonEmptyPart.value.toInt}")
println(s"df.rdd.partitions.length => ${df1.rdd.partitions.length}")
println(s"finalDf.rdd.partitions.length => ${finalDf.rdd.partitions.length}")

輸出

nonEmptyPart => 3
df.rdd.partitions.length => 200
finalDf.rdd.partitions.length => 3

在 Apache Spark 中刪除空的 DataFrame 分區

問題描述

1 個解決方案

解決方案1
9 2017-02-05 06:04:28

解決評論的示例

在 Apache Spark 中刪除空的 DataFrame 分區

問題描述

1 個解決方案

解決方案1 9 2017-02-05 06:04:28

解決評論的示例

解決方案1
9 2017-02-05 06:04:28