简体   繁体   中英

merge spark dStream with variable to saveToCassandra()

I have a DStream[String, Int ] with pairs of word counts, eg ("hello" -> 10) . I want to write these counts to cassandra with a step index. The index is initialized as var step = 1 and is incremented with each microbatch processed.

The cassandra table created as:

CREATE TABLE wordcounts (
    step int,
    word text,
    count int,
primary key (step, word)
);

When trying to write the stream to the table...

stream.saveToCassandra("keyspace", "wordcounts", SomeColumns("word", "count"))

... I get java.lang.IllegalArgumentException: Some primary key columns are missing in RDD or have not been selected: step .

How can I prepend the step index to the stream in order to write the three columns together?

I'm using spark 2.0.0, scala 2.11.8, cassandra 3.4.0 and spark-cassandra-connector 2.0.0-M3.

As noted, while the Cassandra table expects something of the form (Int, String, Int) , the wordCount DStream is of type DStream[(String, Int)] , so for the call to saveToCassandra(...) to work, we need a DStream of type DStream[(Int, String, Int)] .

The tricky part in this question is how to bring a local counter, that is by definition only known in the driver, up to the level of the DStream.

To do that, we need to do two things: "lift" the counter to a distributed level (in Spark, we mean "RDD" or "DataFrame") and join that value with the existing DStream data.

Departing from the classic Streaming word count example:

// Split each line into words
val words = lines.flatMap(_.split(" "))

// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

We add a local var to hold the count of the microbatches:

@transient var batchCount = 0

It's declared transient, so that Spark doesn't try to close over its value when we declare transformations that use it.

Now the tricky bit: Within the context of a DStream transform ation, we make an RDD out of that single var iable and join it with underlying RDD of the DStream using cartesian product:

val batchWordCounts = wordCounts.transform{ rdd => 
  batchCount = batchCount + 1

  val localCount = sparkContext.parallelize(Seq(batchCount))
  rdd.cartesian(localCount).map{case ((word, count), batch) => (batch, word, count)}
}

(Note that a simple map function would not work, as only the initial value of the var iable would be captured and serialized. Therefore, it would look like the counter never increased when looking at the DStream data.

Finally, now that the data is in the right shape, save it to Cassandra:

batchWordCounts.saveToCassandra("keyspace", "wordcounts")

updateStateByKey function is provided by spark for global state handling. For this case it could look something like following

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount: Int = runningCount.getOrElse(0) + 1
    Some(newCount)
}
val step = stream.updateStateByKey(updateFunction _)

stream.join(step).map{case (key,(count, step)) => (step,key,count)})
   .saveToCassandra("keyspace", "wordcounts")

Since you are trying to save the RDD to existing Cassandra table, you need to include all the primary key column values in the RDD.

What you can do is, you can use the below methods to save the RDD to new table.

saveAsCassandraTable or saveAsCassandraTableEx

For more info look into this .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM