如何强制Spark来内联评估DataFrame操作

Question

According to the Spark RDD docs : 根据Spark RDD文档：

All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently. Spark中的所有转换都是懒惰的，因为它们不会立即计算结果......这种设计使Spark能够更有效地运行。

There are times when I need to do certain operations on my dataframes right then and now . 有些时候我需要做一些操作，我dataframes 那么好吧，现在时间。 But because dataframe ops are " lazily evaluated " (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. 但是因为数据帧操作被“ 懒惰地评估 ”（如上所述），当我在代码中编写这些操作时，很少有人保证Spark会实际执行与其余代码内联的操作。 For example: 例如：

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

// Now we need to do a union RIGHT HERE AND NOW, because
// the next few lines of code require the union to have
// already taken place!
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)

// Now do some stuff with 'unionDataFrame'...

So my workaround for this (so far) has been to run .show() or .count() immediately following my time-sensitive dataframe op, like so: 所以，我这个（迄今为止）的解决方法已运行.show()或.count()紧随我的时间敏感数据帧运算，就像这样：

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
unionDataFrame.count()  // Forces the union to execute/compute

// Now do some stuff with 'unionDataFrame'...

...which forces Spark to execute the dataframe op right then in there, inline. ... 强制 Spark执行数据帧操作然后在那里，内联。

This feels awfully hacky/kludgy to me. 这对我来说非常hacky / kludgy。 So I ask: is there a more generally-accepted and/or efficient way to force dataframe ops to happen on-demand (and not be lazily evaluated)? 所以我要问： 是否有一种更普遍接受和/或有效的方法来强制数据帧操作按需发生（而不是懒惰评估）？

Answer 1

No . 不。

You have to call an action to force Spark to do actual work. 你必须调用一个动作来强制Spark做实际的工作。 Transformations won't trigger that effect, and that's one of the reasons to love spark . 转换不会触发这种效果，这也是爱情火花的原因之一。

By the way, I am pretty sure that spark knows very well when something must be done "right here and now" , so probably you are focusing on the wrong point. 顺便说一句，我很确定火花非常清楚何时必须“在这里和现在”完成某些事情，所以你可能正在关注错误的观点。

Can you just confirm that count() and show() are considered "actions" 你能否确认count()和show()被认为是“行动”

You can see some of the action functions of Spark in the documentation , where count() is listed. 您可以在文档中看到Spark的一些动作功能，其中列出了count() 。 show() is not, and I haven't used it before, but it feels like it is an action-how can you show the result without doing actual work? show()不是，我之前没有使用它，但感觉就像是一个动作 - 如何在不做实际工作的情况下显示结果？ :) :)

Are you insinuating that Spark would automatically pick up on that, and do the union (just in time)? 你是否暗示Spark会自动接受，并做联盟（及时）？

Yes ! 是的！ :) :)

spark remembers the transformations you have called, and when an action appears, it will do them, just in -the right- time! spark会记住你调用的转换，当一个动作出现时，它会在正确的时间内完成它们！

Something to remember: Because of this policy, of doing actual work only when an action appears, you will not see a logical error you have in your transformation (s), until the action takes place! 事情要记住：因为这个政策，这样做会出现一个动作只有在实际工作中，你不会看到你在你的转化（S）有一个逻辑上的错误，直到行动发生！

Answer 2

I agree with you that at some point you want to do the action when YOU NEED IT. 我同意你的意见，在某些时候你想要在你需要的时候采取行动。 For .eg if you are streaming data with Spark streaming, and you want to evaluate transformations done on every RDD, rather than accumulating transformations for every RDD, and all of a sudden run a action on this large set of data. 对于.eg，如果您使用Spark流式传输数据，并且您想要评估在每个RDD上完成的转换，而不是为每个RDD累积转换，并且突然对这大量数据集执行操作。

Now, lets say if you have a DataFrame, and you have done all transformations on it, then you can use sparkContext.sql("CACHE table <table-name>") . 现在，假设您有一个DataFrame，并且已经完成了所有转换，那么您可以使用sparkContext.sql("CACHE table <table-name>") 。

This cache is eager cache, this will trigger action on this DataFrame , and evaluate all transformations on this DataFrame. 此缓存是急切缓存，这将触发此DataFrame上的操作，并评估此DataFrame上的所有转换。

如何强制Spark来内联评估DataFrame操作

问题描述

2 个解决方案

解决方案1
7 已采纳 2016-09-08 00:40:45

解决方案2
2 2017-11-02 14:41:57

如何强制Spark来内联评估DataFrame操作

问题描述

2 个解决方案

解决方案1 7 已采纳 2016-09-08 00:40:45

解决方案2 2 2017-11-02 14:41:57

解决方案1
7 已采纳 2016-09-08 00:40:45

解决方案2
2 2017-11-02 14:41:57