简体   繁体   中英

How to parse data from 2 x Kinesis streams in 1 x Spark Streaming App?

I am trying to do query

select a.user_id , b.domain from realTimeTable_1 as a join realTimeTable_2 as b on a.device_id = b.device_id

using two kinesis Streams. However, the output is missing from Stream2, does anyone know how to join or write simultaneously two streams data to hbase or parquet? here is my code, i set SparkConf().set("spark.streaming.concurrentJobs", "2") to process both streams:

val numShards_s1 = kinesisClient.describeStream("stream1").getStreamDescription().getShards().size
val numShards_s2 = kinesisClient.describeStream("stream2").getStreamDescription().getShards().size
val numStreams_s1 = numShards_s1
val numStreams_s2 = numShards_s2
val batchInterval = Seconds(5)
val kinesisClient = new AmazonKinesisClient(credentials)kinesisClient.setEndpoint("https://kinesis.us-east-1.amazonaws.com")
val kinesisCheckpointInterval = batchInterva
val regionName = RegionUtils.getRegionByEndpoint(endpointUrl).getName()
val ssc = new StreamingContext(sc, batchInterval)
val kinesisStreams_s1 = (0 until numStreams_s1).map { i =>
  KinesisUtils.createStream(ssc, "stream-demo", "stream1", endpointUrl, regionName,InitialPositionInStream.LATEST, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2)
}
    val kinesisStreams_s2 = (0 until numShards_s2).map { i =>
  KinesisUtils.createStream(ssc, "stream-demo", "stream2", endpointUrl, regionName,InitialPositionInStream.LATEST, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2)
}
val unionStreams_s1 = ssc.union(kinesisStreams_s1)
val unionStreams_s2 = ssc.union(kinesisStreams_s2)
val schemaString_s1 = "user_id,device_id,action,timestamp
val schemaString_s2= "device_id,domain,timestamp
val tableSchema_s1 = StructType( schemaString_s1.split(",").map(fieldName => StructField(fieldName, StringType, true)))
val tableSchema_s2 = StructType( schemaString_s2.split(",").map(fieldName => StructField(fieldName, StringType, true)))

 unionStreams_s1.foreachRDD ((rdd: RDD[Array[Byte]], time: Time) => { 
   val rowRDD = rdd.map(w => Row.fromSeq(new String(w).split(","))) 
   val output1 = sqlContext.createDataFrame(rowRDD,tableSchema_s1) 
   output1.createOrReplaceTempView("realTimeTable_1")})

unionStreams_s2.foreachRDD ((rdd: RDD[Array[Byte]], time: Time) => { 
   val rowRDD = rdd.map(w => Row.fromSeq(new String(w).split(","))) 
   val output2 = sqlContext.createDataFrame(rowRDD,tableSchema_s2) 
   output1.createOrReplaceTempView("realTimeTable_2")})

so them in theory i should be able to perform:

select a.user_id , b.domain from realTimeTable_1 as a join realTimeTable_2 as b on a.device_id = b.device_id

however even doing select * from realTimeTable_2 is not producing any output, I think my code is missing something, can anyone spot the missing logic please?

At Splice Machine, we never tried the dual streams only a single stream and then joined to persistent data via SQL.

I am not seeing the start of the stream? Here is code that seems very similar to yours, I hope it helps.

Check out KinesisWordCountASL.scala on the master branch of Spark.

Here is a link for the short term.

https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM