简体   繁体   English

从 kafka 读取然后 writeStream 到 json 文件,但在 HDFS json 文件中只找到一条消息

[英]Read from kafka then writeStream to json file, but only found one message in HDFS json file

Just setup a hadoop/kafka/spark, 1 node demo environment.只需设置一个 hadoop/kafka/spark,1 个节点的演示环境。 In pyspark, I try to read(.readStream ) Kafka messages and write(.writeStream) it to json file in hadoop.在 pyspark 中,我尝试读取(.readStream)Kafka 消息并将其写入(.writeStream)到 hadoop 中的 json 文件中。 The weird thing is, under hadoop "output/test" directory, I can find there is a created json file but only within one messages.奇怪的是,在 hadoop“输出/测试”目录下,我可以找到一个已创建的 json 文件,但仅在一条消息中。 All the new messages from kafka will not update the json file.来自 kafka 的所有新消息都不会更新 json 文件。 But I want to all messages which from Kafka will store into one json file.但我想将来自 Kafka 的所有消息存储到一个 json 文件中。
I have tried the sink type as console(writeStream.format("console")) or kafak(writeStream.format("kafka")), it worked as normal.我已尝试将接收器类型设置为 console(writeStream.format("console")) 或 kafak(writeStream.format("kafka")),它正常工作。 Any suggestions or comments?有什么建议或意见吗? Next are sample code.接下来是示例代码。

schema = StructType([StructField("stock_name",StringType(),True),
                     StructField("stock_value", DoubleType(), True),
                     StructField("timestamp", LongType(), True)])

line = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "127.0.1.1:9092") \
  .option("subscribe", "fakestock") \
  .option("startingOffsets","earliest")\
  .load()\
  .selectExpr("CAST(value AS STRING)")
   
df=line.select(functions.from_json(functions.col("value")\
  .cast("string"),schema).alias("parse_value"))\
  .select("parse_value.stock_name","parse_value.stock_value","parse_value.timestamp")
query=df.writeStream\
  .format("json")\
  .option("checkpointLocation", "output/checkpoint")\
  .option("path","output/test")\
  .start()

It's not possible to store all records in one file.不可能将所有记录存储在一个文件中。 Spark periodically polls batches of data as a Kafka consumer, then writes those batches as unique files. Spark 作为 Kafka 消费者定期轮询数据批次,然后将这些批次写入唯一文件。

Without knowing how many records are in the topic to begin with, it's hard to say how many records should be in the output path, but your code looks okay.在不知道主题中有多少记录的情况下,很难说 output 路径中应该有多少记录,但是您的代码看起来还不错。 Parquet is more recommended output format than JSON, however.然而,Parquet 更推荐 output 格式而不是 JSON。

Also worth mentioning that Kafka Connect has an HDFS plugin that only requires writing a config file, no Spark parsing code.另外值得一提的是,Kafka Connect 有一个 HDFS 插件,只需要编写一个配置文件,不需要 Spark 解析代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM