[英]Spark Structured Streaming Error while sending aggregated result to Kafka Topic
所有大数据专家,
我在向 Kafka Topic 发送聚合结果时遇到问题。 它适用于没有聚合的转换。 谁能帮我解决这个问题? 聚合结果对于触发后续事件和不同逻辑很重要。 这是模拟问题。 下面的所有代码都经过测试并且可以正常工作。
Spark 版本 2.4.4
卡夫卡插件 org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
#dummy publisher
CLUSTER_NAME=$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)
for i in {0..10000}; do echo "{\"name\":\"${i}\", \"dt\":$(date +%s)}"; sleep 1; done | /usr/lib/kafka/bin/kafka-console-producer.sh --broker-list ${CLUSTER_NAME}-w-1:9092 --topic test_input
> {"name":"3433", "dt":1580282788}
> {"name":"3434", "dt":1580282789}
> {"name":"3435", "dt":1580282790}
> {"name":"3436", "dt":1580282791}
import time
from pyspark.sql.types import *
from pyspark.sql.functions import *
table='test_input'
wallet_txn_log = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092") \
.option("subscribe", table) \
.load() \
.selectExpr("CAST(value AS STRING) as string").select( from_json("string", schema= StructType([StructField("dt",LongType(),True),StructField("name",StringType(),True)]) ).alias("x")).select('x.*')\
.select(['name',col('dt').cast(TimestampType()).alias("txn_datetime")]) \
.select([to_json(struct('name','txn_datetime')).alias("value")]) \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092") \
.option("topic", "test_output_non_aggregate") \
.option("checkpointLocation", "gs://gcp-datawarehouse/streaming/checkpoints/streaming_test1-{}".format(table)).start()
输出它按预期工作
/usr/lib/kafka/bin/kafka-console-consumer.sh --bootstrap-server ${CLUSTER_NAME}-w-1:9092 --topic test_output_non_aggregate
{"name":"2844","txn_datetime":"2020-01-29T15:16:36.000+08:00"}
{"name":"2845","txn_datetime":"2020-01-29T15:16:37.000+08:00"}
我试过加水印和不加水印,都不起作用
table='test_input'
wallet_txn_log = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092") \
.option("subscribe", table) \
.load() \
.selectExpr("CAST(value AS STRING) as string").select( from_json("string", schema= StructType([StructField("dt",LongType(),True),StructField("name",StringType(),True)]) ).alias("x")).select('x.*')\
.select(['name',col('dt').cast(TimestampType()).alias("txn_datetime")]) \
.withWatermark("txn_datetime", "5 seconds") \
.groupBy('name','txn_datetime').agg(
count("name").alias("is_txn_count")) \
.select([to_json(struct('name','is_txn_count')).alias("value")]) \
.writeStream \
.format("kafka") \
.outputMode("update") \
.option("kafka.bootstrap.servers", "xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092") \
.option("topic", "test_aggregated_output") \
.option("checkpointLocation", "gs://gcp-datawarehouse/streaming/checkpoints/streaming_test1-aggregated_{}".format(table)).start()
错误
[Stage 1:> (0 + 3) / 200]20/01/29 16:20:57 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, cep-m.asia-southeast1-c.c.tngd-poc.internal, executor 1): org.apache.spark.util.TaskCompletionListenerException: null
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
纱线日志-applicationId xxx
查询验证
group By 聚合查询是正确的。 它适用于控制台和内存接收器。 但是,在 Kafka Sink 中,它不断抛出错误。
wallet_txn_log = spark \
... .readStream \
... .format("kafka") \
... .option("kafka.bootstrap.servers", "10.148.15.235:9092,10.148.15.236:9092,10.148.15.233:9092") \
... .option("subscribe", table) \
... .load() \
... .selectExpr("CAST(value AS STRING) as string").select( from_json("string", schema= StructType([StructField("dt",LongType(),True),StructField("name",StringType(),True)]) ).alias("x")).select('x.*')\
... .select(['name',col('dt').cast(TimestampType()).alias("txn_datetime")]) \
... .withWatermark("txn_datetime", "5 seconds") \
... .groupBy('name','txn_datetime').agg(
... count("name").alias("is_txn_count")) \
... .select([to_json(struct('name','is_txn_count')).alias("value")])
>>>
>>> df=wallet_txn_log.writeStream \
... .outputMode("update") \
... .option("truncate", False) \
... .format("console") \
... .start()
-------------------------------------------
Batch: 0
-------------------------------------------
+-----+
|value|
+-----+
+-----+
-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------------------+
|value |
+--------------------------------+
|{"name":"4296","is_txn_count":1}|
|{"name":"4300","is_txn_count":1}|
|{"name":"4297","is_txn_count":1}|
|{"name":"4303","is_txn_count":1}|
|{"name":"4299","is_txn_count":1}|
|{"name":"4305","is_txn_count":1}|
|{"name":"4298","is_txn_count":1}|
|{"name":"4304","is_txn_count":1}|
|{"name":"4307","is_txn_count":1}|
|{"name":"4302","is_txn_count":1}|
|{"name":"4301","is_txn_count":1}|
|{"name":"4306","is_txn_count":1}|
|{"name":"4310","is_txn_count":1}|
|{"name":"4309","is_txn_count":1}|
|{"name":"4308","is_txn_count":1}|
+--------------------------------+
group by 聚合代码是正确的,就像 Spark 2.4.4 的 Kafka 插件在这种情况下有一个小错误一样。 将 spark 从 2.4.4 降级到 2.4.3 后。 上面的错误消失了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.