简体   繁体   English

使用Apache Spark写入HDFS时的输出序列

[英]Output Sequence while writing to HDFS using Apache Spark

I am working on a project in apache Spark and the requirement is to write the processed output from spark into a specific format like Header -> Data -> Trailer . 我正在用Apache Spark开发一个项目,要求将来自Spark的处理后的输出写入到Header -> Data -> Trailer这样的特定格式中。 For writing to HDFS I am using the .saveAsHadoopFile method and writing the data to multiple files using the key as a file name. 为了写入HDFS,我使用.saveAsHadoopFile方法,并使用键作为文件名将数据写入多个文件。 But the issue is the sequence of the data is not maintained files are written in Data->Header->Trailer or a different combination of three. 但是问题是数据的顺序不被维护,文件是用Data->Header->Trailer或三个的不同组合写入的。 Is there anything I am missing with RDD transformation? RDD转换我缺少什么吗?

Ok so after reading from StackOverflow questions, blogs and mail archives from google. 好吧,在阅读完StackOverflow的问题,来自Google的博客和邮件档案之后。 I found out how exactly .union() and other transformation works and how partitioning is managed. 我发现.union()和其他转换的工作原理以及如何管理分区。 When we use .union() the partition information is lost by the resulting RDD and also the ordering and that's why My output sequence was not getting maintained. 当我们使用.union() ,分区信息会因生成的RDD以及顺序而丢失,这就是为什么我的输出序列未得到维护的原因。

What I did to overcome the issue is numbering the Records like 我为解决该问题所做的工作是对记录进行编号

Header = 1, Body = 2, and Footer = 3 页眉= 1,正文= 2,页脚= 3

so using sortBy on RDD which is union of all three I sorted it using this order number with 1 partition. 因此,在RDD上使用sortBy (这是所有三个sortBy并集),我使用具有1个分区的此订单号对其进行了排序。 And after that to write to multiple file using key as filename I used HashPartitioner so that same key data should go into separate file. 之后,使用密钥作为文件名写入多个文件,我使用了HashPartitioner,以便将相同的密钥数据放入单独的文件中。

val header: RDD[(String,(String,Int))] = ... // this is my header RDD`
val data: RDD[(String,(String,Int))] = ... // this is my data RDD
val footer: RDD[(String,(String,Int))] = ... // this is my footer RDD

val finalRDD: [(String,String)] = header.union(data).union(footer).sortBy(x=>x._2._2,true,1).map(x => (x._1,x._2._1))

val output: RDD[(String,String)] = new PairRDDFunctions[String,String](finalRDD).partitionBy(new HashPartitioner(num))

output.saveAsHadoopFile    ... // and using MultipleTextOutputFormat save to multiple file using key as filename

This might not be the final or most economical solution but it worked. 这可能不是最终或最经济的解决方案,但它确实有效。 I am also trying to find other ways to maintain the sequence of output as Header->Body->Footer . 我还试图找到其他方法来将输出顺序保持为Header->Body->Footer I also tried .coalesce(1) on all three RDD's and then do the union but that was just adding three more transformation to RDD's and .sortBy function also take partition information which I thought will be same, but coalesceing the RDDs first also worked. 我还在所有三个RDD上尝试了.coalesce(1) ,然后进行了并集,但这只是在RDD上添加了三个转换,而.sortBy函数也采用了我认为相同的分区信息,但是首先合并RDD也是可行的。 If Anyone has some another approach please let me know, or add more to this will be really helpful as I am new to Spark 如果有人有其他方法,请告诉我,或者添加其他方法将非常有帮助,因为我是Spark的新手。

References: 参考文献:

Write to multiple outputs by key Spark - one Spark job 通过键Spark写入多个输出-一个Spark作业

Ordered union on spark RDDs 在火花RDD上订购工会

http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-2-RDD-s-only-returns-the-first-one-td766.html -- this one helped a lot http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-2-RDD-s-only-returns-the-first-one-td766.html-这对我们有很大帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM