如何在Apache Spark中缓存可被其他Spark作业使用的数据

Question

I have a simple spark code in which I read a file using SparkContext.textFile() and then doing some operations on that data, and I am using spark-jobserver for getting output. 我有一个简单的SparkContext.textFile()代码，其中我使用SparkContext.textFile()读取文件，然后对该数据进行一些操作，并且我正在使用spark-jobserver来获取输出。 In code I am caching the data but after job ends and I execute that spark-job again then it is not taking that same file which is already there in cache. 在代码中，我正在缓存数据，但是作业结束后，我再次执行了spark-job ，那么它并没有获取缓存中已经存在的相同文件。 So, every time file is getting loaded which is taking more time. 因此，每次加载文件都需要更多时间。

Sample Code is as: 示例代码为：

val sc=new SparkContext("local","test")
val data=sc.textFile("path/to/file.txt").cache()
val lines=data.count()
println(lines)

Here, if I am reading the same file then when I execute it second time then it should take data from cache but it is not taking that data from cache. 在这里，如果我正在读取同一文件，则当我第二次执行该文件时，它应该从缓存中获取数据，但不会从缓存中获取该数据。

Is there any way using which I can share the cached data among multiple spark jobs? 有什么方法可以在多个Spark作业之间共享缓存的数据？

Answer 1

是的-通过在RDD上调用持久/缓存，您可以在相同的上下文中获得并提交其他作业

如何在Apache Spark中缓存可被其他Spark作业使用的数据

问题描述

1 个解决方案

解决方案1
1 2015-07-27 10:15:38

如何在Apache Spark中缓存可被其他Spark作业使用的数据

问题描述

1 个解决方案

解决方案1 1 2015-07-27 10:15:38

解决方案1
1 2015-07-27 10:15:38