简体   繁体   English

使用 Cassandra 的 Scala Spark 过滤器 RDD

[英]Scala Spark Filter RDD using Cassandra

I am new to spark-Cassandra and Scala.我是 spark-Cassandra 和 Scala 的新手。 I have an existing RDD.我有一个现有的 RDD。 let say:让我们说:

((url_hash, url, created_timestamp )). ((url_hash, url, created_timestamp ))。

I want to filter this RDD based on url_hash.我想根据 url_hash 过滤这个 RDD。 If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls.如果 url_hash 存在于 Cassandra 表中,那么我想从 RDD 中过滤掉它,这样我就可以只对新的 url 进行处理。

Cassandra Table looks like following: Cassandra 表如下所示:

 url_hash| url | created_timestamp | updated_timestamp

Any pointers will be great.任何指针都会很棒。

I tried something like this this:我试过这样的事情:

   case class UrlInfoT(url_sha256: String, full_url: String, created_ts: Date)
   def timestamp = new java.utils.Date()
   val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
   val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
   val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
   newUrlsRDD = rdd1.subtractByKey(rdd3) 

I am getting cassandra error我收到 cassandra 错误

java.lang.NullPointerException: Unexpected null value of column full_url in      keyspace.url_info.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper

There are no null values in cassandra table cassandra 表中没有空值

Thanks The Archetypal Paul!感谢原型保罗!

I hope somebody finds this useful.我希望有人觉得这很有用。 Had to add Option to case class.必须将 Option 添加到案例类。

Looking forward to better solutions期待更好的解决方案

case class UrlInfoT(url_sha256: String, full_url: Option[String], created_ts: Option[Date])

def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace",   "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM