简体   繁体   English

从Kafka JSON编码问题读取Spark结构化流

[英]Spark structured streaming read from kafka json encoding issue

I struggle to read my JSON data in a kafka topic using Spark Structured Streaming. 我很难使用Spark结构化流技术在kafka主题中读取JSON数据。

Context : 内容:

I'm building a simple pipeline where I read data from a MongoDb (this db is frequently populate from another app) using kafka, then I want to get this data in Spark. 我正在构建一个简单的管道,使用kafka从MongoDb(该数据库通常是从另一个应用程序填充)读取数据的,然后我想在Spark中获取此数据。

For that I'm using Spark Structured Streaming which seems to work. 为此,我使用的似乎是火花结构化流。

Here is my code : 这是我的代码:

import org.apache.spark.rdd
import org.apache.spark.sql.avro._
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.{ArrayType, DataTypes, StructType}
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.schema_of_json
object KafkaToParquetLbcAutomation extends App {





  val spark = SparkSession
    .builder
    .appName("Kafka-Parquet-Writer")
    .master("local")
    .getOrCreate()
  spark.sparkContext.setLogLevel("ERROR")
  import spark.implicits._

  val kafkaRawDf = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers",BROKER IP)
    .option("subscribe", "test")
    .option("startingOffsets", "earliest")
    .load()

  val testJsonDf = kafkaRawDf.selectExpr("CAST(value AS STRING)")






  //affichage des data
  val query = testJsonDf
    .writeStream
    .outputMode("append")
    .format("console")
    .queryName("test")
    .start()
    .awaitTermination()
}

After reading those JSON data I want to make some transformation. 读取这些JSON数据后,我想进行一些转换。

Here start the problem, I can't parse the JSON data due to a strange encoding that I'm not able to decode. 从这里开始问题,由于无法解码的奇怪编码,我无法解析JSON数据。

Therefore I can't go further on my pipeline. 因此,我无法继续进行下去。

How I should get my data : 我应该如何获取数据:

{
  "field 1" : "value 1 ", 
}

(With many other field) (与许多其他领域)

How I actualy get the data : 我实际上是如何获取数据的:

VoituresXhttps://URL.fr/voitures/87478648654.htm�https://img5.url.fr/ad-image/49b7c279087d0cce09123a66557b71d09c01a6d2.jpg�https://img7.url.fr/ad-image/eab7e65419c17542840204fa529b02e64771adbb.jpg�https://img7.urln.fr/ad-image/701b547690e48f11a6e0a1a9e72811cc76fe803e.jpg

The issue might be in the delimiter or someting like. 问题可能出在分隔符或类似符号中。

Can you please help me 你能帮我么

Thank You 谢谢

Problem solve, 问题解决,

It was a bad configuration in the kafka connector code. kafka连接器代码中的配置错误。

I simply had to add this field to the connector : 我只需要将此字段添加到连接器:

"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable":"false",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable":"false",

Nothing to do with Spark 与Spark无关

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Streaming:从 Kafka 读取 JSON 并添加 event_time - Spark Streaming: Read JSON from Kafka and add event_time 如何使用 (Py)Spark Structured Streaming 定义带有时间戳的 JSON 记录的架构(来自 Kafka)? - 显示 null 值 - How to define schema for JSON records with timestamp (from Kafka) using (Py)Spark Structured Streaming? - null values shown 无法读取json文件:使用Java的Spark结构化流 - Not able to read json files: Spark Structured Streaming using java 为 Spark 结构化流解析 JSON - Parse JSON for Spark Structured Streaming 使用结构化流处理来自 kafka 的 json 数据 - Processing json data from kafka using structured streaming Json 字符串应作为 Kafka 主题使用,在 Spark 结构化流中没有模式 - Json string should be consumed as Kafka topic without schema in spark structured streaming 在Spark结构流中读取嵌套Json - Reading Nested Json in Spark-Structured Streaming Spark 结构化流:将行转换为 json - Spark structured streaming: converting row to json Spark Streaming-Java-将Kafka中的JSON插入Cassandra - Spark Streaming - Java - Insert JSON from Kafka into Cassandra 使用 Spark Structured Streaming 读取带有架构的 Kafka Connect JSONConverter 消息 - Reading Kafka Connect JSONConverter messages with schema using Spark Structured Streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM