[英]Spark cache/persist vs shuffle files
I am trying to understand the benefit of spark caching/persist mechanism我试图了解火花缓存/持久机制的好处
AFAIK, spark always persists the RDDs after a shuffle as an optimizaiton. AFAIK,作为优化,spark 总是在洗牌后保留 RDD。 So what benefit does cache/persist call provide?那么缓存/持久调用有什么好处呢? I am assuming the cache/persist happens in memory so the only benefit is that it won't read from the disk, correct?我假设缓存/持久化发生在 memory 所以唯一的好处是它不会从磁盘读取,对吗?
Two differences come to mind我想到了两个区别
what benefit does cache/persist call provide?缓存/持久调用有什么好处?
one example that comes to me at once is that the spark reading process, if you read some data from file system and you do two separate sets of transformations(and finally action) on it, you will load two times the source data(you can check your UI and you will see two loads), but if you cache it, the load process will only happen once.我立即想到的一个例子是火花读取过程,如果您从文件系统读取一些数据并对其进行两组单独的转换(最后是操作),您将加载两倍的源数据(您可以检查你的用户界面,你会看到两次加载),但如果你缓存它,加载过程只会发生一次。
cache/persist happens in memory so the only benefit is that it won't read from the disk缓存/持久化发生在 memory 所以唯一的好处是它不会从磁盘读取
nope, cache and persist happen in different levels and memory is the default level: check out here: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence不,缓存和持久化发生在不同的级别,memory 是默认级别:在此处查看: https://spark.apache.org/ddocs#rddhtml-persistence/
Spark dataframes are lazy evaluated and if you do something like Spark 数据帧是惰性评估的,如果您执行类似的操作
val a = df1.join(df2)
val b = a.groupBy(col).agg(...)
a.write.parquet(...)
b.write.parquet(...)
then df1
and df2
will be scanned and joined twice, once in each write
operation.然后df1
和df2
将被扫描并连接两次,每次write
操作一次。
That is, unless you cache or persist them.也就是说,除非您缓存或持久化它们。
Persistence for shuffle is a different thing altogether, and deals with the internals of the shuffle operation - not something that impacts what you see at the application layer. shuffle 的持久性完全是另一回事,它处理 shuffle 操作的内部 - 不会影响您在应用程序层看到的内容。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.