Spark cache/persist vs shuffle files

Question

I am trying to understand the benefit of spark caching/persist mechanism我试图了解火花缓存/持久机制的好处

AFAIK, spark always persists the RDDs after a shuffle as an optimizaiton. AFAIK，作为优化，spark 总是在洗牌后保留 RDD。 So what benefit does cache/persist call provide?那么缓存/持久调用有什么好处呢？ I am assuming the cache/persist happens in memory so the only benefit is that it won't read from the disk, correct?我假设缓存/持久化发生在 memory 所以唯一的好处是它不会从磁盘读取，对吗？

Answer 1

Two differences come to mind我想到了两个区别

During shuffle, intermediate data (data that need to be shuffled across nodes) gets saved so as to avoid reshuffling.在 shuffle 期间，中间数据（需要跨节点 shuffle 的数据）被保存，以避免重新洗牌。 This gets reflected in Spark UI as skipped stages.这在 Spark UI 中反映为跳过的阶段。 With cache/persist, you are caching the processed data .使用缓存/持久化，您正在缓存处理后的数据。
You are in control of what need to be cached but you doesn't have explicit control on caching shuffled data (it is behind the scenes optimization).您可以控制需要缓存的内容，但您无法明确控制缓存混洗数据（这是幕后优化）。

Answer 2

what benefit does cache/persist call provide?缓存/持久调用有什么好处？

one example that comes to me at once is that the spark reading process, if you read some data from file system and you do two separate sets of transformations(and finally action) on it, you will load two times the source data(you can check your UI and you will see two loads), but if you cache it, the load process will only happen once.我立即想到的一个例子是火花读取过程，如果您从文件系统读取一些数据并对其进行两组单独的转换（最后是操作），您将加载两倍的源数据（您可以检查你的用户界面，你会看到两次加载），但如果你缓存它，加载过程只会发生一次。

cache/persist happens in memory so the only benefit is that it won't read from the disk缓存/持久化发生在 memory 所以唯一的好处是它不会从磁盘读取

nope, cache and persist happen in different levels and memory is the default level: check out here: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence不，缓存和持久化发生在不同的级别，memory 是默认级别：在此处查看： https://spark.apache.org/ddocs#rddhtml-persistence/

Answer 3

Spark dataframes are lazy evaluated and if you do something like Spark 数据帧是惰性评估的，如果您执行类似的操作

val a = df1.join(df2)
val b = a.groupBy(col).agg(...)
a.write.parquet(...)
b.write.parquet(...)

then df1 and df2 will be scanned and joined twice, once in each write operation.然后df1和df2将被扫描并连接两次，每次write操作一次。

That is, unless you cache or persist them.也就是说，除非您缓存或持久化它们。

Persistence for shuffle is a different thing altogether, and deals with the internals of the shuffle operation - not something that impacts what you see at the application layer. shuffle 的持久性完全是另一回事，它处理 shuffle 操作的内部 - 不会影响您在应用程序层看到的内容。

Spark cache/persist vs shuffle files

问题描述

3 个解决方案

解决方案1
0 2022-09-09 18:56:31

解决方案2
0 2022-09-10 12:49:50

解决方案3
0 2022-09-11 02:25:09

Spark cache/persist vs shuffle files

问题描述

3 个解决方案

解决方案1 0 2022-09-09 18:56:31

解决方案2 0 2022-09-10 12:49:50

解决方案3 0 2022-09-11 02:25:09

解决方案1
0 2022-09-09 18:56:31

解决方案2
0 2022-09-10 12:49:50

解决方案3
0 2022-09-11 02:25:09