[英]How to Write Structured Streaming Data into Cassandra with PySpark?
I want to write spark structured streaming data into cassandra. 我想将Spark结构化的流数据写入cassandra。 My spark version is 2.4.0.
我的Spark版本是2.4.0。
I've research some post and some of uses DataStax enterprise platform. 我研究了一些帖子,并使用了DataStax企业平台。 I've didn't use it and found a method
foreachBatch
which helps for write streaming data to sink. 我没有使用它,而是找到了
foreachBatch
方法,该方法有助于将流数据写入接收器。
I've review a docs based on the databricks site . 我已经审查了一个基于databricks 网站的文档。 And try it own.
并自己尝试。
This is the code I've written: 这是我编写的代码:
parsed = parsed_opc \
.withWatermark("sourceTimeStamp", "10 minutes") \
.dropDuplicates(["id", "sourceTimeStamp"]) \
.groupBy(
window(parsed_opc.sourceTimeStamp, "4 seconds"),
parsed_opc.id
) \
.agg({"value": "avg"}) \
.withColumnRenamed("avg(value)", "avg")\
.withColumnRenamed("window", "sourceTime")
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="opc", keyspace="poc")\
.save()
parsed.writeStream \
.foreachBatch(writeToCassandra) \
.outputMode("update") \
.start()
The schema of the parsed
dataframe is: parsed
数据帧的架构为:
root
|-- sourceTime: struct (nullable = false)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- id: string (nullable = true)
|-- avg: double (nullable = true)
I can succesfully write this streaming df to console like this: 我可以成功地将此流df写入控制台,如下所示:
query = parsed \
.writeStream \
.format("console")\
.outputMode("complete")\
.start()
And the outputs as follows in console: 控制台中的输出如下:
+--------------------+----+---+
| sourceTime| id|avg|
+--------------------+----+---+
|[2019-07-20 18:55...|Temp|2.0|
+--------------------+----+---+
So, when writing to the console, thats OK. 因此,当写入控制台时,没关系。 But when I query in the
cqlsh
there is no record appended to the table. 但是当我在
cqlsh
查询时,没有记录追加到表中。
This is the table create script in cassandra: 这是cassandra中的表创建脚本:
CREATE TABLE poc.opc ( id text, avg float,sourceTime timestamp PRIMARY KEY );
So, Can you tell me what is wrong? 所以,你能告诉我哪里出问题了吗?
After working on subject I've found the solution. 在研究主题之后,我找到了解决方案。
Looking to the terminal logs closely, I figured it out that there is an error log which is: com.datastax.spark.connector.types.TypeConversionException: Cannot convert object [2019-07-20 18:55:00.0,2019-07-20 18:55:04.0] of type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema to java.util.Date.
仔细查看终端日志,我发现有一个错误日志是:
com.datastax.spark.connector.types.TypeConversionException: Cannot convert object [2019-07-20 18:55:00.0,2019-07-20 18:55:04.0] of type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema to java.util.Date.
It is because, when doing window
operation in spark, It adds a struct to the schema on timestamp column which is in this case sourceTime
. 这是因为,当在spark中执行
window
操作时,它将一个结构添加到timestamp列上的架构,在本例中为sourceTime
。 The schema of the sourceTime
looks like this: sourceTime
的架构如下所示:
sourceTime: struct (nullable = false)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
But I've created a column in cassandra which is already sourceTime
but it expects only one timestamp value. 但是我已经在cassandra中创建了一个列,该列已经是
sourceTime
但是它只需要一个时间戳值。 If looking to the error, it tries to send start
and end
timeStamp parameter which are not exist in cassandra table. 如果查找错误,它将尝试发送cassandra表中不存在的
start
和end
timeStamp参数。
So, selecting this columns from parsed
dataframe solved the problem: cassandra_df = parsed.select("sourcetime.start", "avg", "sourcetime.end", "id")
. 因此,从
parsed
数据帧中选择此列可解决问题: cassandra_df = parsed.select("sourcetime.start", "avg", "sourcetime.end", "id")
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.