[英]how to cache data in apache spark that can be used by other spark job
I have a simple spark code in which I read a file using SparkContext.textFile()
and then doing some operations on that data, and I am using spark-jobserver
for getting output. 我有一个简单的
SparkContext.textFile()
代码,其中我使用SparkContext.textFile()
读取文件,然后对该数据进行一些操作,并且我正在使用spark-jobserver
来获取输出。 In code I am caching the data but after job ends and I execute that spark-job
again then it is not taking that same file which is already there in cache. 在代码中,我正在缓存数据,但是作业结束后,我再次执行了
spark-job
,那么它并没有获取缓存中已经存在的相同文件。 So, every time file is getting loaded which is taking more time. 因此,每次加载文件都需要更多时间。
Sample Code is as: 示例代码为:
val sc=new SparkContext("local","test")
val data=sc.textFile("path/to/file.txt").cache()
val lines=data.count()
println(lines)
Here, if I am reading the same file then when I execute it second time then it should take data from cache but it is not taking that data from cache. 在这里,如果我正在读取同一文件,则当我第二次执行该文件时,它应该从缓存中获取数据,但不会从缓存中获取该数据。
Is there any way using which I can share the cached data among multiple spark jobs? 有什么方法可以在多个Spark作业之间共享缓存的数据?
是的-通过在RDD上调用持久/缓存,您可以在相同的上下文中获得并提交其他作业
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.