简体   繁体   English

Spark Structured Streaming写入镶木地板会创建如此多的文件

[英]Spark Structured Streaming writing to parquet creates so many files

I used structured streaming to load messages from kafka, do some aggreation then write to parquet file. 我使用结构化流来加载来自kafka的消息,做一些聚合然后写入镶木地板文件。 The problem is that there are so many parquet files created (800 files) for only 100 messages from kafka. 问题是,只有来自kafka的100条消息创建了很多镶木地板文件(800个文件)。

The aggregation part is: 聚合部分是:

return model
            .withColumn("timeStamp", col("timeStamp").cast("timestamp"))
            .withWatermark("timeStamp", "30 seconds")
            .groupBy(window(col("timeStamp"), "5 minutes"))
            .agg(
                count("*").alias("total"));

The query: 查询:

StreamingQuery query = result //.orderBy("window")
            .writeStream()
            .outputMode(OutputMode.Append())
            .format("parquet")
            .option("checkpointLocation", "c:\\bigdata\\checkpoints")
            .start("c:\\bigdata\\parquet");

When loading one of the parquet file using spark, it shows empty 使用spark加载其中一个镶木地板文件时,它显示为空

+------+-----+
|window|total|
+------+-----+
+------+-----+

How can I save the dataset to only one parquet file? 如何将数据集保存到一个镶木地板文件? Thanks 谢谢

My idea was to use Spark Structured Streaming to consume events from Azure Even Hub then store them on storage in a parquet format. 我的想法是使用Spark Structured Streaming来使用Azure Even Hub中的事件,然后将它们存储在镶木地板格式的存储中。

I finally figured out how to deal with many small files created. 我终于想出了如何处理创建的许多小文件。 Spark version 2.4.0. Spark版本2.4.0。

This how my query looks like 这是我的查询的样子

dfInput
  .repartition(1, col('column_name'))
  .select("*")
  .writeStream
  .format("parquet")
  .option("path", "adl://storage_name.azuredatalakestore.net/streaming")
  .option("checkpointLocation", "adl://storage_name.azuredatalakestore.net/streaming_checkpoint")
  .trigger(processingTime='480 seconds')
  .start()

As a result, I have one file created on a storage location every 480 seconds. 因此,我每480秒在存储位置创建一个文件。 To figure out the balance between file size and number of files to avoid OOM error, just play with two parameters: number of partitions and processingTime , which means the batch interval. 要找出文件大小和文件数之间的平衡以避免OOM错误,只需使用两个参数:分区数和processingTime ,这意味着批处理间隔。

I hope you can adjust the solution to your use case. 我希望您可以针对您的用例调整解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM