简体   繁体   English

如何处理从Kafka到Cassandra的pySpark结构化流

[英]How to deal with pySpark structured streaming coming from Kafka to Cassandra

I'm using pyspark to get data from Kafka and inserting it into cassandra. I'm almost there i just need the final step.我正在使用 pyspark 从 Kafka 获取数据并将其插入 cassandra。我快到了,我只需要最后一步。

def Spark_Kafka_Receiver():

# STEP 1 OK!

    dc = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "000.00.0.240:9092") \
        .option("subscribe", "MyTopic") \
    .load()
    dc.selectExpr("CAST(key as STRING)", "CAST(value AS STRING) as msg")

# STEP 2 OK!

    dc.writeStream \
        .outputMode("append") \
        .foreachBatch(foreach_batch_function) \
        .start() \
        .awaitTermination()

# STEP 3 NEED HELP

def foreach_batch_function(df, epoch_id):
    Value = df.select(df.value)

    ???????

    # WRITE DATA FRAME ON CASSANDRA
    df.write \
        .format("org.apache.spark.sql.cassandra") \
        .mode('append') \
        .options(table=table_name, keyspace=keyspace) \
        .save()

So i have my Value that is in this format:所以我有这种格式的价值:

DataFrame[value: binary] DataFrame[值:二进制]

i would need to insert something that open my Value take the binary inside and create a nice dataframe with the correct format that mach the database and with it execute the last part of my code.我需要插入一些东西来打开我的值,将二进制文件放入其中,并创建一个漂亮的 dataframe,格式正确,可以匹配数据库,并用它执行我代码的最后一部分。

You don't need to use foreachBatch anymore.您不再需要使用foreachBatch You just need to upgrade to Spark Cassandra Connector 2.5 that natively supports Spark Structured Streaming, so you can just write:您只需要升级到原生支持 Spark Structured Streaming 的 Spark Cassandra Connector 2.5,这样您就可以编写:

dc.writeStream \
        .format("org.apache.spark.sql.cassandra") \
        .mode('append') \
        .options(table=table_name, keyspace=keyspace)
        .start() \
        .awaitTermination()

Regarding the second part of your question - if you want to convert your value into a multiple columns, you need to use from_json function, passing the schema to it.关于你问题的第二部分——如果你想将你的值转换成多列,你需要使用from_json function,将模式传递给它。 Here is example in Scala, but Python code should be quite similar:这是 Scala 中的示例,但 Python 代码应该非常相似:

val schemaStr = "id:int, value:string"
val schema = StructType.fromDDL(schemaStr)
val data = dc.selectExpr("CAST(value AS STRING)")
  .select(from_json($"value", schema).as("data"))
  .select("data.*").drop("data")

and then you can write that data via writeStream然后您可以通过writeStream写入该数据

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有没有办法使用pyspark从Kafka到Cassandra设置结构化流 - Is there a way to set up structured streaming with pyspark from Kafka to Cassandra Cassandra Sink for PySpark 来自 Kafka 主题的结构化流式传输 - Cassandra Sink for PySpark Structured Streaming from Kafka topic 如何使用PySpark将结构化流数据写入Cassandra? - How to Write Structured Streaming Data into Cassandra with PySpark? 从卡夫卡接收/合并/更新Cassandra查询的数据到结构化流 - Combining/Updating Cassandra Queried data to Structured Streaming receieved from Kafka 来自 Kafka 的 pySpark Structured Streaming 不会输出到控制台进行调试 - pySpark Structured Streaming from Kafka does not output to console for debugging 从Kafka读取时Pyspark结构化流中的异常 - Exception in Pyspark Structured Streaming while reading from Kafka 如何从Kafka读取并打印出记录以在pyspark的结构化流中进行控制台? - How to read from Kafka and print out records to console in Structured Streaming in pyspark? 使用 Kafka-Jupyter 在本地进行 Pyspark 结构化流式传输 - Pyspark Structured streaming locally with Kafka-Jupyter Pyspark Structured Streaming Kafka 配置错误 - Pyspark Structured Streaming Kafka configuration error PySpark 结构化流:一旦不与 Kafka 一起使用就触发 - PySpark Structured Streaming: trigger once not working with Kafka
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM