从 Kafka 读取的 Spark Structured Streaming 应用程序仅返回 null 值

Question

I plan to extract the data from Kafka using Spark Structured Streaming, but I got empty data.我计划使用 Spark Structured Streaming 从 Kafka 中提取数据，但我得到了空数据。

# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_csv, from_json
from pyspark.sql.types import StringType, StructType

if __name__ == '__main__':
    spark = SparkSession \
        .builder \
        .appName("pyspark_structured_streaming_kafka") \
        .getOrCreate()

    df_raw = spark.read \
        .format("kafka") \
        .option("kafka.bootstrap.servers","52.81.249.81:9092") \
        .option("subscribe","product") \
        .option("kafka.ssl.endpoint.identification.algorithm","") \
        .option("kafka.isolation.level","read_committed") \
        .load()

    df_raw.printSchema()

    product_schema = StructType() \
        .add("product_name", StringType()) \
        .add("product_factory", StringType()) \
        .add("yield_num", StringType()) \
        .add("yield_time", StringType()) 

    df_1=df_raw.selectExpr("CAST(value AS STRING)") \
               .select(from_json("value",product_schema).alias("data")) \
               .select("data.*") \
               .write \
               .format("console") \
               .save()

My test data is the following我的测试数据如下

{
  "product_name": "X Laptop",
  "product_factory": "B-3231",
  "yield_num": 899,
  "yield_time": "20210201 22:00:01"
}

But the result is out of my predication但结果出乎我的预料

./spark-submit ~/Documents/3-Playground/kbatch.py
+------------+---------------+---------+----------+
|product_name|product_factory|yield_num|yield_time|
+------------+---------------+---------+----------+
|        null|           null|     null|      null|
|        null|           null|     null|      null|

The test data was published by the command:测试数据通过命令发布：

./kafka-producer-perf-test.sh --topic product --num-records 90000000 --throughput 5 --producer.config ../config/producer.properties --payload-file ~/Downloads/product.json

If cut away some code just like this如果像这样删掉一些代码

df_1=df_raw.selectExpr("CAST(value AS STRING)") \
               .writeStream \
               .format("console") \
               .outputMode("append") \
               .option("checkpointLocation","file:///Users/picomy/Kafka-Output/checkpoint") \
               .start() \
               .awaitTermination()

The result is the following结果如下

Batch: 3130
-------------------------------------------
+--------------------+
|               value|
+--------------------+
|    "yield_time":...|
|    "product_name...|
|    "yield_num": ...|
|    "product_fact...|
|    "yield_num": ...|
|    "yield_num": ...|
|    "product_fact...|
|    "product_fact...|
|    "product_name...|
|    "product_fact...|
|    "product_name...|
|                   }|
|    "yield_time":...|
|    "product_name...|
|                   }|
|    "product_fact...|
|    "yield_num": ...|
|    "product_fact...|
|    "yield_time":...|
|    "product_name...|
+--------------------+

I don't know where is the problem's root cause.我不知道问题的根本原因在哪里。

Answer 1

There are few things causing your code not to be working correct:有几件事会导致您的代码无法正常工作：

Wrong schema (the field yield_num is an integer/long)错误的模式（字段 yield_num 是整数/长整数）
Using writeStream instead of just write (if you want streaming)使用 writeStream 而不是只写（如果你想要流式传输）
Start and awaitTermination of the streaming query流式查询的开始和等待终止
The data in your json file should be stored in one line only json 文件中的数据应仅存储在一行中

You can replace parts of your code with the following snippet:您可以使用以下代码段替换部分代码：

from pyspark.sql.types import StringType, StructType, LongType

    product_schema = StructType() \
        .add("product_name", StringType()) \
        .add("product_factory", StringType()) \
        .add("yield_num", LongType()) \
        .add("yield_time", StringType()) 

    df_1=df_raw.selectExpr("CAST(value AS STRING)") \
               .select(from_json("value",product_schema).alias("data")) \
               .select("data.*") \
               .writeStream \
               .format("console") \
               .start()
               .awaitTermination()

从 Kafka 读取的 Spark Structured Streaming 应用程序仅返回 null 值

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-04 07:33:10

从 Kafka 读取的 Spark Structured Streaming 应用程序仅返回 null 值

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-04 07:33:10

解决方案1
1 已采纳 2021-02-04 07:33:10