简体   繁体   English

避免在写入多个输出文件的Hadoop Pig脚本中发生文件冲突

[英]Avoiding file collisions in Hadoop Pig script that writes multiple output files

I'm writing a Pig script that looks as follows: 我正在编写一个如下所示的Pig脚本:

...
myGroup = group simplifiedJoinData by (dir1, dir2, dir3, dir4);
betterGroup = foreach myGroup { 
   value1Value2 = foreach simplifiedJoinedGroup generate value1, value2; 
   distinctValue1Value2 = DISTINCT value1Value2; generate group, distinctValue1Value2; 
}
store betterGroup into '/myHdfsPath/myMultiStorageTest' using MyMultiStorage('output', '0', 'none' );

Please note that the schema of simplifiedJoinData is simplifiedJoinedGroup: {dir1: long,dir2: long,dir3: chararray,dir4: chararray,value1: chararray,value2: chararray} 请注意,简化的JoinData的模式是简体的JoinedGroup:{dir1:long,dir2:long,dir3:chararray,dir4:chararray,value1:chararray,value2:chararray}

It uses a custom storage class (MyMultiStorage - basically a modified version of MultiStorage in the piggybank) that writes multiple output files. 它使用一个自定义存储类(MyMultiStorage-基本上是储钱罐中MultiStorage的修改版本)来写入多个输出文件。 The custom storage class expects that the values passed to it are in the following format: 定制存储类期望传递给它的值采用以下格式:

{group:(dir1:long,dir2:long,dir3:chararray,dir4:chararray), bag:{(value1:chararrary,value2:chararray)}}

What I'd like the custom storage class to do is output multiple files as follows: dir/dir2/dir3/dir4/value1_values.txt dir/dir2/dir3/dir4/value2_values.txt 我希望自定义存储类要做的是输出多个文件,如下所示:dir / dir2 / dir3 / dir4 / value1_values.txt dir / dir2 / dir3 / dir4 / value2_values.txt

where the value1_values.txt contains all the value1 values and value2_values.txt contains all the value2 values. 其中value1_values.txt包含所有value1值,value2_values.txt包含所有value2值。 Ideally I would prefer not to write multiple part files that have to be combined later (Note that the example has been simplified for the purposes of this discussion. The real output files are binary structures that can't be combined with a simple cat). 理想情况下,我不想编写多个零件文件,这些文件以后必须合并(请注意,为便于讨论,示例已简化。实际的输出文件是不能与简单的cat合并的二进制结构)。 I have this working for small data sets; 我的工作是处理小型数据集; however, when I run with larger data sets, I run into issues where I get exceptions in Hadoop that the output file name already exists or that it is already being created: 但是,当我使用较大的数据集运行时,我遇到了以下问题:在Hadoop中出现异常,即输出文件名已经存在或已经被创建:

java.io.IOException: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException

I suspect that this is because mutiple mappers or reducers are attempting to write the same file, and I am not using part IDs in the filename as PigStorage does. 我怀疑这是因为多个映射器或精简器正试图写入相同的文件,并且我没有像PigStorage那样在文件名中使用部件ID。 However, I would have expected that by grouping the data, I'd only have one record for each dir1, dir2, dir3, dir4 combination, and, as such, only one mapper or reducer would be attempting to write a particular file for a given run. 但是,我希望通过对数据进行分组,每个dir1,dir2,dir3,dir4组合都只有一条记录,因此,只有一个映射器或reducer会尝试为一个目录写入一个特定文件。给奔跑。 I've tried running without speculative execution for both map and reduce tasks, but that seems to have had no effect. 我已经尝试过不对映射和归约任务执行任何投机执行,但这似乎没有任何效果。 Clearly I don't understand what's going on here. 显然我不明白这里发生了什么。

My question is: Why am I getting the AlreadyBeingCreatedException? 我的问题是:为什么我会收到AlreadyBeingCreatedException?

If there is no way for me to have a single reducer write all data for each record, it would be acceptable to have to write multiple parts output files in a directory (one per reducer) and combine them after the fact. 如果我没有办法让单个化简器为每条记录写入所有数据,则必须在目录中写入多个零件输出文件(每个化简器一个)并在事后将它们合并。 It just wouldn't be ideal. 只是不理想。 However, as of yet, I have not been able to determine the proper way to have the custom storage class determine a unique filename, and I still end up with multiple reducers trying to create/write the same file. 但是,到目前为止,我还无法确定使自定义存储类确定唯一文件名的正确方法,并且最终仍然会遇到多个reducer尝试创建/写入相同文件的情况。 Is there a particular method in the job configuration or context that would allow me to coordinate parts accross the job? 在工作配置或上下文中是否有特定的方法可以让我协调整个工作中的各个部分?

Thanks in advance for any help you can provide. 在此先感谢您提供的任何帮助。

Turns out that there was a condition where I was generating the same file name due to a tuple parsing error. 原来,由于元组解析错误,有一种情况我生成了相同的文件名。 I was getting the AlreadyBeingCreatedException for that exact reason. 正是因为这个原因,我才得到了AlreadyBeingCreatedException。

Nothing wrong with the custom store function, or approaching the problem in this manner. 自定义存储功能没有任何问题,或者以这种方式解决问题。 Just a silly mistake on my part! 我这只是一个愚蠢的错误!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM