简体   繁体   English

使用spark / scala追加/连接两个文件

[英]Append/concatenate two files using spark/scala

I have multiple files stored in HDFS, and I need to merge them into one file using spark. 我有多个文件存储在HDFS中,我需要使用spark将它们合并到一个文件中。 However, because this operation is done frequently (every hour). 但是,因为此操作经常(每小时)完成。 I need to append those multiple files to the source file. 我需要将这些多个文件附加到源文件中。

I found that there is the FileUtil that gives the 'copymerge' function. 我发现有FileUtil提供'copymerge'功能。 but it doesn't allow to append two files. 但它不允许附加两个文件。

Thank you for your help 谢谢您的帮助

You can do this with two methods: 您可以使用两种方法执行此操作:

 sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")

Or as @Pushkr has proposed 或者像@Pushkr提出的那样

 new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")

If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark ) 如果您不想每小时创建一个新源并覆盖相同的源,则可以使用带有保存模式覆盖的数据帧( 如何覆盖spark中的输出目录

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM