简体   繁体   English

从spark执行器查询cassandra

[英]Query cassandra from spark executor

I have a streaming app off of kafka, and I was wondering if there was a way to do a range query from inside a map function? 我有一个关于kafka的流媒体应用程序,我想知道是否有办法在地图功能中进行范围查询?

I group the messages from kafka by time range and key, and then based on those time ranges and keys I want to pull data from cassandra into that dstream. 我按照时间范围和密钥对来自kafka的消息进行分组,然后根据我想要将数据从cassandra拉入该dstream的时间范围和密钥。

Something like: 就像是:

lookups
  .map(lookup => ((lookup.key, lookup.startTime, lookup.endTime), lookup))
  .groupByKey()
  .transform(rdd => {
    val cassandraSQLContext = new CassandraSQLContext(rdd.context)
    rdd.map(lookupPair => {
      val tableName = //variable based on lookup
      val startTime = aggLookupPair._1._2
      val endTime = aggLookupPair._1._3

      cassandraSQLContext
        .cassandraSql(s"SELECT * FROM ${CASSANDRA_KEYSPACE}.${tableName} WHERE key=${...} AND start_time >= ${startTime} AND start_time < ${endTime};")
        .map(row => {
           //match to {
            case /*case 1*/ => new object1(row)
            case /*case 2*/ =>new object2(row)
          }
        })
        .collect()
    })
  })

This gives me a null pointer exception: 这给了我一个空指针异常:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 59.0 failed 1 times, most recent failure: Lost task 0.0 in stage 59.0 (TID 63, localhost): java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:231)
at org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:70)
at RollupFineGrainIngestionService$$anonfun$11$$anonfun$apply$2.apply(MyFile.scala:130)
at RollupFineGrainIngestionService$$anonfun$11$$anonfun$apply$2.apply(MyFile.scala:123)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

I've also tried to ssc.cassandraTable(CASSANDRA_KEYSPACE, tableName).where("key = ?", ...)... but spark crashes when trying to access StreamingContext inside of a map. 我也尝试过ssc.cassandraTable(CASSANDRA_KEYSPACE, tableName).where("key = ?", ...)...但是在尝试访问地图内的StreamingContext时火花会崩溃。

If anyone has any suggestions, I would appreciate it. 如果有人有任何建议,我将不胜感激。 Thanks! 谢谢!

You may want to use joinWithCassandraTable if your query is based off of a partition key. 如果查询基于分区键,则可能需要使用joinWithCassandraTable

But if you need more flexibility 但如果你需要更多的灵活性

CassandraConnector(sc.getConf).withSessionDo( session => ...)

Will allow you to access the session pool on the executor and execute whatever you want without managing connections. 允许您访问执行程序上的会话池并执行您想要的任何操作而无需管理连接。 The code is all serializable and can be placed within maps. 代码都是可序列化的,可以放在地图中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM