简体   繁体   English

使用 Spark 结构化流将数据写入 JSON 数组

[英]Writing data as JSON array with Spark Structured Streaming

I have to write data from Spark Structure streaming as JSON Array, I have tried using below code:我必须将 Spark Structure 流中的数据写入 JSON 数组,我尝试使用以下代码:

df.selectExpr("to_json(struct(*)) AS value").toJSON

which returns me DataSet[String], but unable to write as JSON Array.它返回我 DataSet [String],但无法写入 JSON 数组。

Current Output:当前 Output:

{"name":"test","id":"id"}
{"name":"test1","id":"id1"}

Expected Output:预期 Output:

[{"name":"test","id":"id"},{"name":"test1","id":"id1"}]

Edit (moving comments into question):编辑(将评论移入问题):

After using proposed collect_list method I am getting使用建议collect_list方法后,我得到了

Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;

Then I tried something like this -然后我尝试了这样的事情 -

withColumn("timestamp", unix_timestamp(col("event_epoch"), "MM/dd/yyyy hh:mm:ss aa")) .withWatermark("event_epoch", "1 minutes") .groupBy(col("event_epoch")) .agg(max(col("event_epoch")).alias("timestamp")) 

But I don't want to add a new column.但我不想添加新列。

You can use the SQL built-in function collect_list for this.您可以为此使用 SQL 内置 function collect_list This function collects and returns a set of non-unique elements (compared to collect_set which returns only unique elements).此 function 收集并返回一组非唯一元素(与仅返回唯一元素的collect_set相比)。

From the source code for collect_list you will see that this is an aggregation function.collect_list的源代码中,您将看到这是一个聚合 function。 Based on the requirements given in the Structured Streaming Programming Guide on Output Modes it is highlighted that the output modes "complete" and "updated" are supported for aggregations without a watermark.根据Output 模式的结构化流编程指南中给出的要求,强调 output 模式“完整”和“更新”支持无水印的聚合。

在此处输入图像描述

As I understand from your comments, you do not wish to add watermark and new columns.我从您的评论中了解到,您不希望添加水印和新列。 Also, the error you are facing此外,您面临的错误

Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark; 

reminds you to not use the output mode "append".提醒您不要使用 output 模式“追加”。

In the comments, you have mentioned that you plan to produce the results into a Kafka message.在评论中,您提到您计划将结果生成到 Kafka 消息中。 One big JSON Array as one Kafka value.一个大的 JSON 数组作为一个 Kafka 值。 The complete code could look like完整的代码可能看起来像

val df = spark.readStream
  .[...] // in my test I am reading from Kafka source
  .load()
  .selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value", "offset", "partition")
  // do not forget to convert you data into a String before writing to Kafka
  .selectExpr("CAST(collect_list(to_json(struct(*))) AS STRING) AS value")

df.writeStream
  .format("kafka")
  .outputMode("complete")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("topic", "test")
  .option("checkpointLocation", "/path/to/sparkCheckpoint")
  .trigger(Trigger.ProcessingTime(10000))
  .start()
  .awaitTermination()

Given the key/value pairs (k1,v1), (k2,v2), and (k3,v3) as inputs you will get a value in the Kafka topic that contains all selected data as a JSON Array:给定键/值对 (k1,v1)、(k2,v2) 和 (k3,v3) 作为输入,您将在 Kafka 主题中获得一个值,其中包含作为 JSON 数组的所有选定数据:

[{"key":"k1","value":"v1","offset":7,"partition":0}, {"key":"k2","value":"v2","offset":8,"partition":0}, {"key":"k3","value":"v3","offset":9,"partition":0}]

Tested with Spark 3.0.1 and Kafka 2.5.0.使用 Spark 3.0.1 和 Kafka 2.5.0 进行测试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM