简体   繁体   English

如何将HDFS小文件合并为一个大文件?

[英]How to merge HDFS small files into a one large file?

I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date ie the original folder may have number of previous files but I only like to merge for given date files to one single file. 我有一些从Kafka流生成的小文件,所以我喜欢将小文件合并为一个文件,但是这种合并是基于日期的,即原始文件夹可能有多个先前的文件,但是我只喜欢将给定的日期文件合并为一个单个文件。

Any suggestions? 有什么建议么?

Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file): 使用下面的代码来遍历较小的文件并将它们聚合成一个大文件(假设source包含指向较小文件的HDFS路径,而target则是您想要较大结果文件的路径):

val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))

This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well 此示例假定文本文件格式,但是您也可以读取任何Spark支持的格式,并且也可以将不同的格式用于源和目标

you should be able to use .repartition(1) to write all results to 1 file. 您应该可以使用.repartition(1)将所有结果写入1个文件。 if you need to split by date, consider partitionBy("your_date_value") . 如果需要按日期拆分,请考虑partitionBy("your_date_value")

if you're working within HDFS and S3, this may also be helpful. 如果您使用的是HDFS和S3,这也可能会有所帮助。 you might actually even use s3-dist-cp and stay within HDFS. 您甚至可能实际上使用s3-dist-cp并保留在HDFS中。

https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5 https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-effectively-between-hdfs-and-amazon-s3/ #5

There's a specific option to aggregate multiple files in HDFS using a --groupBy option based na regular expression pattern. 有一个特定的选项,可以使用基于--groupBy选项的基于正则表达式的模式在HDFS中聚合多个文件。 So if the date is in the file name, you can group based on that pattern. 因此,如果日期在文件名中,则可以基于该模式进行分组。

You can develop a spark application. 您可以开发一个spark应用程序。 Using this application read the data from small files and create dataframe and write dataframe to big file in append mode. 使用此应用程序从小文件中读取数据并创建dataframe然后以追加模式将dataframe dataframe写入大文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM