简体   繁体   English

插入前火花混洗数据

[英]Spark shuffling data before insert

CalcDf().show results into 13 stages(0-12) +1(13) for the show itself. CalcDf().show结果分为 13 个阶段(0-12)+1(13),用于show本身。 在此处输入图片说明

When I try to write the result to table, I assume there should only be 13 stages(0-12), instead I see and additional stage(13).当我尝试将结果写入表格时,我假设应该只有 13 个阶段(0-12),而不是我看到了额外的阶段(13)。 Where does it come from and what does it do?它来自哪里,它有什么作用? I'm not performing any repartition or other operation that would require a shuffle.我没有执行任何需要洗牌的重新分区或其他操作。 As far as I understand spark should just write 1100 files into the table, but it's not what's happening.据我所知,spark 应该只将 1100 个文件写入表中,但这不是正在发生的事情。

CalcDf()
.write
.mode(SaveMode.Overwrite)
.insertInto("tn")

在此处输入图片说明 CalcDf() logic CalcDf() 逻辑

val dim = spark.sparkContext.broadcast(
spark.table("dim")
.as[Dim]
.map(r => r.id-> r.col)
.collect().toMap
)

spark.table("table")
.as[CustomCC]
.groupByKey(_.id)
.flatMapGroups{case(k, iterator) => CustomCC.mapRows(iterator, dim)}
.withColumn("time_key", lit("2021-07-01"))

Previous stage #12 has done shuffle write so any subsequent stage will have to read data from this via shuffle read (which you notice in #13).前一阶段 #12 已经完成了 shuffle write,因此任何后续阶段都必须通过 shuffle read 从中读取数据(您在 #13 中注意到了这一点)。

Why is there an additional stage ?为什么有一个额外的阶段?

because stage 12 has shuffle write and not an Output因为第 12 阶段有随机写入而不是输出

for understanding of stage #12 please give information about how CalDf is built.为了理解第 12 阶段,请提供有关如何构建 CalDf 的信息。

EDIT编辑

groupByKey will do shuffle write, for getting same ids on single executor JVM. groupByKey 将执行随机写入,以便在单个执行程序 JVM 上获得相同的 ID。

stage 13 is reading this shuffled data and computing map operation after.第 13 阶段正在读取此混洗数据并计算映射操作之后。

difference in task count can be attributed to action.任务计数的差异可归因于操作。
In show(), it hasn't read whole shuffled data.在 show() 中,它没有读取整个混洗数据。 Maybe because it displays 20 rows (default) whereas in insertInto(...), it is operating on whole data so reading all.也许是因为它显示 20 行(默认),而在 insertInto(...) 中,它对整个数据进行操作,因此读取全部。

stage #13 is not just because it is writing files but it is actually doing computation.第 13 阶段不仅仅是因为它正在写入文件,而是因为它实际上是在进行计算。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM