Spark dataframe.map() 多次处理每一行

Question

Running this code:运行这段代码：

(1) (1)

val resultDf = myDataFrame.map(row => { println(s"$row"); return row })

I can see exactly one print out (use "yarn logs -applicationId xxxx" to get the log) for each row.我可以准确地看到每一行的打印输出（使用“yarn logs -applicationId xxxx”获取日志）。 However when the processing code is more complex:但是当处理代码比较复杂时：

(2) (2)

val resultDf = myDataFrame.map(row => { println(s"$row"); /* complex processing code */})

I find about 2 or 3 times more print out than the actual row count.我发现打印出来的行数比实际行数多 2 到 3 倍。 But in both cases myDataFrame.count == resultDf.count但在这两种情况下myDataFrame.count == resultDf.count

Question: in case (2) I see more print out, is that because Spark runs dataFrame.map() in more containers for redundancy, and throws away extra results when redundant executions all return successfully?问题：如果情况（2）我看到更多打印输出，是因为 Spark 在更多容器中运行 dataFrame.map() 以实现冗余，并在冗余执行全部成功返回时丢弃额外结果吗？ Thanks.谢谢。

BTW I run spark jobs on aws emr, spark 3.1.2顺便说一句，我在 aws emr、spark 3.1.2 上运行 spark 作业

Answer 1

The result dataframe passed on downstream, at one point it's not cached but referenced 3 times, resulting in multiple executions of myDataFrame.map().结果 dataframe 传递到下游，在某一时刻它没有被缓存而是被引用了 3 次，导致多次执行 myDataFrame.map()。

Spark dataframe.map() 多次处理每一行

问题描述

1 个解决方案

解决方案1
0 2022-04-25 19:16:21

Spark dataframe.map() 多次处理每一行

问题描述

1 个解决方案

解决方案1 0 2022-04-25 19:16:21

解决方案1
0 2022-04-25 19:16:21