简体   繁体   中英

Read from kafka then writeStream to json file, but only found one message in HDFS json file

Just setup a hadoop/kafka/spark, 1 node demo environment. In pyspark, I try to read(.readStream ) Kafka messages and write(.writeStream) it to json file in hadoop. The weird thing is, under hadoop "output/test" directory, I can find there is a created json file but only within one messages. All the new messages from kafka will not update the json file. But I want to all messages which from Kafka will store into one json file.
I have tried the sink type as console(writeStream.format("console")) or kafak(writeStream.format("kafka")), it worked as normal. Any suggestions or comments? Next are sample code.

schema = StructType([StructField("stock_name",StringType(),True),
                     StructField("stock_value", DoubleType(), True),
                     StructField("timestamp", LongType(), True)])

line = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "127.0.1.1:9092") \
  .option("subscribe", "fakestock") \
  .option("startingOffsets","earliest")\
  .load()\
  .selectExpr("CAST(value AS STRING)")
   
df=line.select(functions.from_json(functions.col("value")\
  .cast("string"),schema).alias("parse_value"))\
  .select("parse_value.stock_name","parse_value.stock_value","parse_value.timestamp")
query=df.writeStream\
  .format("json")\
  .option("checkpointLocation", "output/checkpoint")\
  .option("path","output/test")\
  .start()

It's not possible to store all records in one file. Spark periodically polls batches of data as a Kafka consumer, then writes those batches as unique files.

Without knowing how many records are in the topic to begin with, it's hard to say how many records should be in the output path, but your code looks okay. Parquet is more recommended output format than JSON, however.

Also worth mentioning that Kafka Connect has an HDFS plugin that only requires writing a config file, no Spark parsing code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM