简体   繁体   中英

Avoiding file collisions in Hadoop Pig script that writes multiple output files

I'm writing a Pig script that looks as follows:

...
myGroup = group simplifiedJoinData by (dir1, dir2, dir3, dir4);
betterGroup = foreach myGroup { 
   value1Value2 = foreach simplifiedJoinedGroup generate value1, value2; 
   distinctValue1Value2 = DISTINCT value1Value2; generate group, distinctValue1Value2; 
}
store betterGroup into '/myHdfsPath/myMultiStorageTest' using MyMultiStorage('output', '0', 'none' );

Please note that the schema of simplifiedJoinData is simplifiedJoinedGroup: {dir1: long,dir2: long,dir3: chararray,dir4: chararray,value1: chararray,value2: chararray}

It uses a custom storage class (MyMultiStorage - basically a modified version of MultiStorage in the piggybank) that writes multiple output files. The custom storage class expects that the values passed to it are in the following format:

{group:(dir1:long,dir2:long,dir3:chararray,dir4:chararray), bag:{(value1:chararrary,value2:chararray)}}

What I'd like the custom storage class to do is output multiple files as follows: dir/dir2/dir3/dir4/value1_values.txt dir/dir2/dir3/dir4/value2_values.txt

where the value1_values.txt contains all the value1 values and value2_values.txt contains all the value2 values. Ideally I would prefer not to write multiple part files that have to be combined later (Note that the example has been simplified for the purposes of this discussion. The real output files are binary structures that can't be combined with a simple cat). I have this working for small data sets; however, when I run with larger data sets, I run into issues where I get exceptions in Hadoop that the output file name already exists or that it is already being created:

java.io.IOException: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException

I suspect that this is because mutiple mappers or reducers are attempting to write the same file, and I am not using part IDs in the filename as PigStorage does. However, I would have expected that by grouping the data, I'd only have one record for each dir1, dir2, dir3, dir4 combination, and, as such, only one mapper or reducer would be attempting to write a particular file for a given run. I've tried running without speculative execution for both map and reduce tasks, but that seems to have had no effect. Clearly I don't understand what's going on here.

My question is: Why am I getting the AlreadyBeingCreatedException?

If there is no way for me to have a single reducer write all data for each record, it would be acceptable to have to write multiple parts output files in a directory (one per reducer) and combine them after the fact. It just wouldn't be ideal. However, as of yet, I have not been able to determine the proper way to have the custom storage class determine a unique filename, and I still end up with multiple reducers trying to create/write the same file. Is there a particular method in the job configuration or context that would allow me to coordinate parts accross the job?

Thanks in advance for any help you can provide.

Turns out that there was a condition where I was generating the same file name due to a tuple parsing error. I was getting the AlreadyBeingCreatedException for that exact reason.

Nothing wrong with the custom store function, or approaching the problem in this manner. Just a silly mistake on my part!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM