简体   繁体   English

Kafka Spark流式缓存

[英]Kafka Spark Streaming cache

I have been getting data from one kafka topics in the form of JavaPairInputDStream (with twitter streaming api), plan is getting data from two topics checking for duplication with tweet_id and if its not in the package (package is for sending back to kafka), add it. 我一直在以JavaPairInputDStream的形式从一个kafka主题获取数据(带有Twitter流式api),计划从两个主题获取数据,检查是否与tweet_id重复,并且如果不在包中(该软件包用于发送回kafka),添加它。 Also i want to cache data for x mins then work on it. 我也想为x分钟缓存数据,然后对其进行处理。

I can get data from kafka topic and output it with 我可以从kafka主题获取数据并将其输出

stream.foreachRDD(rdd -> {
    System.out.println("--- New RDD with " + rdd.partitions().size()
     + " partitions and " + rdd.count() + " records");
     rdd.foreach(record -> System.out.println(record._2));});

But i cant manage to cache it. 但是我无法缓存它。 Tried rdd.cache() and persist with count() . 尝试了rdd.cache()并坚持使用count() but it doesn't seem to do trick or i just wasn't able to understand it. 但它似乎并没有奏效,或者我只是听不懂。

Anyone can guide me how to do this stuff? 任何人都可以指导我该怎么做?

Okay so its impossible to cache rdd like this it seems. 好的,看起来不可能像这样缓存rdd。 I created another rdd and I'm using union() whenever stream creates new rdd and caching this way. 我创建了另一个rdd,每当流创建新的rdd并以此方式进行缓存时,我都使用union()。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM