如何将HDFS小文件合并为一个大文件？

Question

I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date ie the original folder may have number of previous files but I only like to merge for given date files to one single file. 我有一些从Kafka流生成的小文件，所以我喜欢将小文件合并为一个文件，但是这种合并是基于日期的，即原始文件夹可能有多个先前的文件，但是我只喜欢将给定的日期文件合并为一个单个文件。

Any suggestions? 有什么建议么？

Answer 1

Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file): 使用下面的代码来遍历较小的文件并将它们聚合成一个大文件（假设source包含指向较小文件的HDFS路径，而target则是您想要较大结果文件的路径）：

val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))

This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well 此示例假定文本文件格式，但是您也可以读取任何Spark支持的格式，并且也可以将不同的格式用于源和目标

Answer 2

you should be able to use .repartition(1) to write all results to 1 file. 您应该可以使用.repartition(1)将所有结果写入1个文件。 if you need to split by date, consider partitionBy("your_date_value") . 如果需要按日期拆分，请考虑partitionBy("your_date_value") 。

if you're working within HDFS and S3, this may also be helpful. 如果您使用的是HDFS和S3，这也可能会有所帮助。 you might actually even use s3-dist-cp and stay within HDFS. 您甚至可能实际上使用s3-dist-cp并保留在HDFS中。

https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5 https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-effectively-between-hdfs-and-amazon-s3/ ＃5

There's a specific option to aggregate multiple files in HDFS using a --groupBy option based na regular expression pattern. 有一个特定的选项，可以使用基于--groupBy选项的基于正则表达式的模式在HDFS中聚合多个文件。 So if the date is in the file name, you can group based on that pattern. 因此，如果日期在文件名中，则可以基于该模式进行分组。

Answer 3

You can develop a spark application. 您可以开发一个spark应用程序。 Using this application read the data from small files and create dataframe and write dataframe to big file in append mode. 使用此应用程序从小文件中读取数据并创建dataframe然后以追加模式将dataframe dataframe写入大文件。

如何将HDFS小文件合并为一个大文件？

问题描述

3 个解决方案

解决方案1
2 2018-07-25 19:25:19

解决方案2
1 2018-07-25 20:47:34

解决方案3
-1 2018-07-25 19:14:51

如何将HDFS小文件合并为一个大文件？

问题描述

3 个解决方案

解决方案1 2 2018-07-25 19:25:19

解决方案2 1 2018-07-25 20:47:34

解决方案3 -1 2018-07-25 19:14:51

解决方案1
2 2018-07-25 19:25:19

解决方案2
1 2018-07-25 20:47:34

解决方案3
-1 2018-07-25 19:14:51