简体   繁体   English

使用 Spark 流式传输 cassandra 中的最新数据

[英]Stream the most recent data in cassandra with spark streaming

I continuously have data being written to cassandra from an outside source.我不断地将数据从外部源写入 cassandra。

Now, I am using spark streaming to continuously read this data from cassandra with the following code:现在,我正在使用火花流通过以下代码从 cassandra 连续读取这些数据:

val ssc = new StreamingContext(sc, Seconds(5))

val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")


val dstream = new ConstantInputDStream(ssc, cassandraRDD)


dstream.foreachRDD { rdd =>
 println("\n"+rdd.count())
}

ssc.start()
ssc.awaitTermination()
sc.stop()

However, the following line:但是,以下行:

val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")

takes the entire table data from cassandra every time.每次都从 cassandra 获取整个表数据。 Now just the newest data saved into the table.现在只是保存到表中的最新数据。

What I want to do is have spark streaming read only the latest data, ie, the data added after its previous read.我想要做的是让火花流只读取最新的数据,即在上次读取后添加的数据。

How can I achieve this?我怎样才能做到这一点? I tried to Google this but got very little documentation regarding this.我试图谷歌这个,但关于这个的文档很少。

I am using spark 1.4.1 , scala 2.10.4 and cassandra 2.1.12 .我正在使用spark 1.4.1scala 2.10.4cassandra 2.1.12

Thanks!谢谢!

EDIT:编辑:

The suggested duplicate question (asked by me) is NOT a duplicate, because it talks about connecting spark streaming and cassandra and this question is about streaming only the latest data.建议的重复问题(由我提出)不是重复的,因为它讨论了连接 Spark 流和 cassandra,而这个问题是关于仅流式传输最新数据。 BTW, streaming from cassandra IS possible by using the code I provided.顺便说一句,使用我提供的代码可以从 cassandra 流式传输。 However, it takes the entire table every time and not just the latest data.但是,它每次都需要整个表,而不仅仅是最新的数据。

Cassandra 将有一些低级工作,允许将传入 Cassandra 的新突变通知外部系统(索引器、Spark 流等),请阅读: https : //issues.apache.org/jira/browse/卡桑德拉-8844

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM