[英]Kafka Spark Streaming cache
I have been getting data from one kafka topics in the form of JavaPairInputDStream
(with twitter streaming api), plan is getting data from two topics checking for duplication with tweet_id and if its not in the package (package is for sending back to kafka), add it. 我一直在以
JavaPairInputDStream
的形式从一个kafka主题获取数据(带有Twitter流式api),计划从两个主题获取数据,检查是否与tweet_id重复,并且如果不在包中(该软件包用于发送回kafka),添加它。 Also i want to cache data for x mins then work on it. 我也想为x分钟缓存数据,然后对其进行处理。
I can get data from kafka topic and output it with 我可以从kafka主题获取数据并将其输出
stream.foreachRDD(rdd -> {
System.out.println("--- New RDD with " + rdd.partitions().size()
+ " partitions and " + rdd.count() + " records");
rdd.foreach(record -> System.out.println(record._2));});
But i cant manage to cache it. 但是我无法缓存它。 Tried
rdd.cache()
and persist with count()
. 尝试了
rdd.cache()
并坚持使用count()
。 but it doesn't seem to do trick or i just wasn't able to understand it. 但它似乎并没有奏效,或者我只是听不懂。
Anyone can guide me how to do this stuff? 任何人都可以指导我该怎么做?
Okay so its impossible to cache rdd like this it seems. 好的,看起来不可能像这样缓存rdd。 I created another rdd and I'm using union() whenever stream creates new rdd and caching this way.
我创建了另一个rdd,每当流创建新的rdd并以此方式进行缓存时,我都使用union()。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.