I want to write spark structured streaming data into cassandra. My spark version is 2.4.0.
I've research some post and some of uses DataStax enterprise platform. I've didn't use it and found a method foreachBatch
which helps for write streaming data to sink.
I've review a docs based on the databricks site . And try it own.
This is the code I've written:
parsed = parsed_opc \
.withWatermark("sourceTimeStamp", "10 minutes") \
.dropDuplicates(["id", "sourceTimeStamp"]) \
.groupBy(
window(parsed_opc.sourceTimeStamp, "4 seconds"),
parsed_opc.id
) \
.agg({"value": "avg"}) \
.withColumnRenamed("avg(value)", "avg")\
.withColumnRenamed("window", "sourceTime")
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="opc", keyspace="poc")\
.save()
parsed.writeStream \
.foreachBatch(writeToCassandra) \
.outputMode("update") \
.start()
The schema of the parsed
dataframe is:
root
|-- sourceTime: struct (nullable = false)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- id: string (nullable = true)
|-- avg: double (nullable = true)
I can succesfully write this streaming df to console like this:
query = parsed \
.writeStream \
.format("console")\
.outputMode("complete")\
.start()
And the outputs as follows in console:
+--------------------+----+---+
| sourceTime| id|avg|
+--------------------+----+---+
|[2019-07-20 18:55...|Temp|2.0|
+--------------------+----+---+
So, when writing to the console, thats OK. But when I query in the cqlsh
there is no record appended to the table.
This is the table create script in cassandra:
CREATE TABLE poc.opc ( id text, avg float,sourceTime timestamp PRIMARY KEY );
So, Can you tell me what is wrong?
After working on subject I've found the solution.
Looking to the terminal logs closely, I figured it out that there is an error log which is: com.datastax.spark.connector.types.TypeConversionException: Cannot convert object [2019-07-20 18:55:00.0,2019-07-20 18:55:04.0] of type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema to java.util.Date.
It is because, when doing window
operation in spark, It adds a struct to the schema on timestamp column which is in this case sourceTime
. The schema of the sourceTime
looks like this:
sourceTime: struct (nullable = false)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
But I've created a column in cassandra which is already sourceTime
but it expects only one timestamp value. If looking to the error, it tries to send start
and end
timeStamp parameter which are not exist in cassandra table.
So, selecting this columns from parsed
dataframe solved the problem: cassandra_df = parsed.select("sourcetime.start", "avg", "sourcetime.end", "id")
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.