将聚合结果发送到 Kafka 主题时 Spark 结构化流错误

Question

所有大数据专家，

我在向 Kafka Topic 发送聚合结果时遇到问题。 它适用于没有聚合的转换。 谁能帮我解决这个问题？ 聚合结果对于触发后续事件和不同逻辑很重要。 这是模拟问题。 下面的所有代码都经过测试并且可以正常工作。

Spark 版本 2.4.4

卡夫卡插件 org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4

数据源

#dummy publisher
CLUSTER_NAME=$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)  

for i in {0..10000}; do echo "{\"name\":\"${i}\", \"dt\":$(date +%s)}";  sleep 1; done | /usr/lib/kafka/bin/kafka-console-producer.sh     --broker-list ${CLUSTER_NAME}-w-1:9092  --topic test_input

> {"name":"3433", "dt":1580282788} 
> {"name":"3434", "dt":1580282789}
> {"name":"3435", "dt":1580282790} 
> {"name":"3436", "dt":1580282791}

转换（不通过聚合分组）

import time
from pyspark.sql.types import *
from pyspark.sql.functions import *

table='test_input'
wallet_txn_log = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092") \
    .option("subscribe", table) \
    .load() \
    .selectExpr("CAST(value AS STRING) as string").select( from_json("string", schema=   StructType([StructField("dt",LongType(),True),StructField("name",StringType(),True)])    ).alias("x")).select('x.*')\
    .select(['name',col('dt').cast(TimestampType()).alias("txn_datetime")]) \
    .select([to_json(struct('name','txn_datetime')).alias("value")]) \
    .writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092") \
    .option("topic", "test_output_non_aggregate") \
    .option("checkpointLocation", "gs://gcp-datawarehouse/streaming/checkpoints/streaming_test1-{}".format(table)).start()

输出它按预期工作

/usr/lib/kafka/bin/kafka-console-consumer.sh     --bootstrap-server ${CLUSTER_NAME}-w-1:9092 --topic test_output_non_aggregate

{"name":"2844","txn_datetime":"2020-01-29T15:16:36.000+08:00"}
{"name":"2845","txn_datetime":"2020-01-29T15:16:37.000+08:00"}

按聚合分组

我试过加水印和不加水印，都不起作用

table='test_input'
wallet_txn_log = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092") \
    .option("subscribe", table) \
    .load() \
    .selectExpr("CAST(value AS STRING) as string").select( from_json("string", schema= StructType([StructField("dt",LongType(),True),StructField("name",StringType(),True)])  ).alias("x")).select('x.*')\
    .select(['name',col('dt').cast(TimestampType()).alias("txn_datetime")]) \
    .withWatermark("txn_datetime", "5 seconds") \
    .groupBy('name','txn_datetime').agg( 
     count("name").alias("is_txn_count")) \
    .select([to_json(struct('name','is_txn_count')).alias("value")]) \
    .writeStream \
    .format("kafka") \
    .outputMode("update") \
    .option("kafka.bootstrap.servers", "xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092") \
    .option("topic", "test_aggregated_output") \
    .option("checkpointLocation", "gs://gcp-datawarehouse/streaming/checkpoints/streaming_test1-aggregated_{}".format(table)).start()

错误

[Stage 1:>                                                        (0 + 3) / 200]20/01/29 16:20:57 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, cep-m.asia-southeast1-c.c.tngd-poc.internal, executor 1): org.apache.spark.util.TaskCompletionListenerException: null
        at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138)
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

纱线日志-applicationId xxx

链接到纱线日志

查询验证

group By 聚合查询是正确的。 它适用于控制台和内存接收器。 但是，在 Kafka Sink 中，它不断抛出错误。

wallet_txn_log = spark \
...     .readStream \
...     .format("kafka") \
...     .option("kafka.bootstrap.servers", "10.148.15.235:9092,10.148.15.236:9092,10.148.15.233:9092") \
...     .option("subscribe", table) \
...     .load() \
...     .selectExpr("CAST(value AS STRING) as string").select( from_json("string", schema= StructType([StructField("dt",LongType(),True),StructField("name",StringType(),True)])  ).alias("x")).select('x.*')\
...     .select(['name',col('dt').cast(TimestampType()).alias("txn_datetime")]) \
...     .withWatermark("txn_datetime", "5 seconds") \
...     .groupBy('name','txn_datetime').agg( 
...      count("name").alias("is_txn_count")) \
...     .select([to_json(struct('name','is_txn_count')).alias("value")]) 
>>> 
>>> df=wallet_txn_log.writeStream \
...     .outputMode("update") \
...     .option("truncate", False) \
...     .format("console") \
...     .start()
-------------------------------------------                                     
Batch: 0
-------------------------------------------
+-----+
|value|
+-----+
+-----+

-------------------------------------------                                     
Batch: 1
-------------------------------------------
+--------------------------------+
|value                           |
+--------------------------------+
|{"name":"4296","is_txn_count":1}|
|{"name":"4300","is_txn_count":1}|
|{"name":"4297","is_txn_count":1}|
|{"name":"4303","is_txn_count":1}|
|{"name":"4299","is_txn_count":1}|
|{"name":"4305","is_txn_count":1}|
|{"name":"4298","is_txn_count":1}|
|{"name":"4304","is_txn_count":1}|
|{"name":"4307","is_txn_count":1}|
|{"name":"4302","is_txn_count":1}|
|{"name":"4301","is_txn_count":1}|
|{"name":"4306","is_txn_count":1}|
|{"name":"4310","is_txn_count":1}|
|{"name":"4309","is_txn_count":1}|
|{"name":"4308","is_txn_count":1}|
+--------------------------------+

Answer 1

group by 聚合代码是正确的，就像 Spark 2.4.4 的 Kafka 插件在这种情况下有一个小错误一样。 将 spark 从 2.4.4 降级到 2.4.3 后。 上面的错误消失了。

将聚合结果发送到 Kafka 主题时 Spark 结构化流错误

问题描述

数据源

转换（不通过聚合分组）

按聚合分组

1 个解决方案

解决方案1
1 已采纳 2020-03-04 10:19:31

将聚合结果发送到 Kafka 主题时 Spark 结构化流错误

问题描述

数据源

转换（不通过聚合分组）

按聚合分组

1 个解决方案

解决方案1 1 已采纳 2020-03-04 10:19:31

解决方案1
1 已采纳 2020-03-04 10:19:31