简体   繁体   English

Spark dataframe.map() 多次处理每一行

[英]Spark dataframe.map() processed each row more than once

Running this code:运行这段代码:

(1) (1)

val resultDf = myDataFrame.map(row => { println(s"$row"); return row })

I can see exactly one print out (use "yarn logs -applicationId xxxx" to get the log) for each row.我可以准确地看到每一行的打印输出(使用“yarn logs -applicationId xxxx”获取日志)。 However when the processing code is more complex:但是当处理代码比较复杂时:

(2) (2)

val resultDf = myDataFrame.map(row => { println(s"$row"); /* complex processing code */})

I find about 2 or 3 times more print out than the actual row count.我发现打印出来的行数比实际行数多 2 到 3 倍。 But in both cases myDataFrame.count == resultDf.count但在这两种情况下myDataFrame.count == resultDf.count

Question: in case (2) I see more print out, is that because Spark runs dataFrame.map() in more containers for redundancy, and throws away extra results when redundant executions all return successfully?问题:如果情况(2)我看到更多打印输出,是因为 Spark 在更多容器中运行 dataFrame.map() 以实现冗余,并在冗余执行全部成功返回时丢弃额外结果吗? Thanks.谢谢。

BTW I run spark jobs on aws emr, spark 3.1.2顺便说一句,我在 aws emr、spark 3.1.2 上运行 spark 作业

The result dataframe passed on downstream, at one point it's not cached but referenced 3 times, resulting in multiple executions of myDataFrame.map().结果 dataframe 传递到下游,在某一时刻它没有被缓存而是被引用了 3 次,导致多次执行 myDataFrame.map()。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM