Kafka Spark流式缓存

Question

I have been getting data from one kafka topics in the form of JavaPairInputDStream (with twitter streaming api), plan is getting data from two topics checking for duplication with tweet_id and if its not in the package (package is for sending back to kafka), add it. 我一直在以JavaPairInputDStream的形式从一个kafka主题获取数据（带有Twitter流式api），计划从两个主题获取数据，检查是否与tweet_id重复，并且如果不在包中（该软件包用于发送回kafka），添加它。 Also i want to cache data for x mins then work on it. 我也想为x分钟缓存数据，然后对其进行处理。

I can get data from kafka topic and output it with 我可以从kafka主题获取数据并将其输出

stream.foreachRDD(rdd -> {
    System.out.println("--- New RDD with " + rdd.partitions().size()
     + " partitions and " + rdd.count() + " records");
     rdd.foreach(record -> System.out.println(record._2));});

But i cant manage to cache it. 但是我无法缓存它。 Tried rdd.cache() and persist with count() . 尝试了rdd.cache()并坚持使用count() 。 but it doesn't seem to do trick or i just wasn't able to understand it. 但它似乎并没有奏效，或者我只是听不懂。

Anyone can guide me how to do this stuff? 任何人都可以指导我该怎么做？

Answer 1

Okay so its impossible to cache rdd like this it seems. 好的，看起来不可能像这样缓存rdd。 I created another rdd and I'm using union() whenever stream creates new rdd and caching this way. 我创建了另一个rdd，每当流创建新的rdd并以此方式进行缓存时，我都使用union（）。

Kafka Spark流式缓存

问题描述

1 个解决方案

解决方案1
0 2018-07-03 14:07:24

Kafka Spark流式缓存

问题描述

1 个解决方案

解决方案1 0 2018-07-03 14:07:24

解决方案1
0 2018-07-03 14:07:24