简体   繁体   English

如何使用PySpark将结构化流数据写入Cassandra?

[英]How to Write Structured Streaming Data into Cassandra with PySpark?

I want to write spark structured streaming data into cassandra. 我想将Spark结构化的流数据写入cassandra。 My spark version is 2.4.0. 我的Spark版本是2.4.0。

I've research some post and some of uses DataStax enterprise platform. 我研究了一些帖子,并使用了DataStax企业平台。 I've didn't use it and found a method foreachBatch which helps for write streaming data to sink. 我没有使用它,而是找到了foreachBatch方法,该方法有助于将流数据写入接收器。

I've review a docs based on the databricks site . 我已经审查了一个基于databricks 网站的文档。 And try it own. 并自己尝试。

This is the code I've written: 这是我编写的代码:

parsed = parsed_opc \
    .withWatermark("sourceTimeStamp", "10 minutes") \
    .dropDuplicates(["id", "sourceTimeStamp"]) \
    .groupBy(
        window(parsed_opc.sourceTimeStamp, "4 seconds"),
        parsed_opc.id
    ) \
    .agg({"value": "avg"}) \
    .withColumnRenamed("avg(value)", "avg")\
    .withColumnRenamed("window", "sourceTime") 

def writeToCassandra(writeDF, epochId):
  writeDF.write \
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table="opc", keyspace="poc")\
    .save()

parsed.writeStream \
    .foreachBatch(writeToCassandra) \
    .outputMode("update") \
    .start()

The schema of the parsed dataframe is: parsed数据帧的架构为:

root
 |-- sourceTime: struct (nullable = false)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- id: string (nullable = true)
 |-- avg: double (nullable = true)

I can succesfully write this streaming df to console like this: 我可以成功地将此流df写入控制台,如下所示:

 query = parsed \
  .writeStream \
  .format("console")\
  .outputMode("complete")\
  .start()

And the outputs as follows in console: 控制台中的输出如下:

+--------------------+----+---+
|          sourceTime|  id|avg|
+--------------------+----+---+
|[2019-07-20 18:55...|Temp|2.0|
+--------------------+----+---+

So, when writing to the console, thats OK. 因此,当写入控制台时,没关系。 But when I query in the cqlsh there is no record appended to the table. 但是当我在cqlsh查询时,没有记录追加到表中。

This is the table create script in cassandra: 这是cassandra中的表创建脚本:

CREATE TABLE poc.opc ( id text, avg float,sourceTime timestamp PRIMARY KEY );

So, Can you tell me what is wrong? 所以,你能告诉我哪里出问题了吗?

After working on subject I've found the solution. 在研究主题之后,我找到了解决方案。

Looking to the terminal logs closely, I figured it out that there is an error log which is: com.datastax.spark.connector.types.TypeConversionException: Cannot convert object [2019-07-20 18:55:00.0,2019-07-20 18:55:04.0] of type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema to java.util.Date. 仔细查看终端日志,我发现有一个错误日志是: com.datastax.spark.connector.types.TypeConversionException: Cannot convert object [2019-07-20 18:55:00.0,2019-07-20 18:55:04.0] of type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema to java.util.Date.

It is because, when doing window operation in spark, It adds a struct to the schema on timestamp column which is in this case sourceTime . 这是因为,当在spark中执行window操作时,它将一个结构添加到timestamp列上的架构,在本例中为sourceTime The schema of the sourceTime looks like this: sourceTime的架构如下所示:

sourceTime: struct (nullable = false)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)

But I've created a column in cassandra which is already sourceTime but it expects only one timestamp value. 但是我已经在cassandra中创建了一个列,该列已经是sourceTime但是它只需要一个时间戳值。 If looking to the error, it tries to send start and end timeStamp parameter which are not exist in cassandra table. 如果查找错误,它将尝试发送cassandra表中不存在的startend timeStamp参数。

So, selecting this columns from parsed dataframe solved the problem: cassandra_df = parsed.select("sourcetime.start", "avg", "sourcetime.end", "id") . 因此,从parsed数据帧中选择此列可解决问题: cassandra_df = parsed.select("sourcetime.start", "avg", "sourcetime.end", "id")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM