简体   繁体   中英

Hadoop streaming job with pipes in the combiner

I'm trying to run a Hadoop Streaming job like so:

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.*.jar \
  -files count.pl \
  -input "/my_events/*.bz2" \
  -output count-events \
  -mapper "cut -f2,4 | grep foo | cut -f1" \
  -combiner "perl count.pl -s | perl count.pl" \
  -reducer "perl count.pl"

The count.pl script is a simple script that just counts keys, looping over the input like so (simplified):

while(<>) {
  chomp;
  my($k,$c) = split /\t/, $_, 2;
  $c ||= 1;
  $count{$k} += $c;
}
while (my ($k, $c) = each %count) {
  print "$k\t$c\n";
}

It fails, and in the Hadoop syslog output I see crazy crazy things like this - note that it somehow contains the perl script source, and some 1's, and some bzipped data:

2014-03-26 19:04:20,595 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : java.io.IOException: subprocess exited successfully
R/W/S=8193/81/0 in:4096=8193/2 [rec/s] out:40=81/2 [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 HOST=null
USER=kwilliams
HADOOP_USER=null
last tool output: |}    1   1rint "$k\t$c\n";   1each %count) { 1ne $lastkey)) {    1��@p@P 0�H�l$�H��L�d$�L�l$�L�t$�H��(
H�GhH�wH��H��H�GhHc�H��H)�H��H���C����L�AH�L�$�J�4&�F��H�L�0E1��~H�EJ�t �F��H�D�hA��H��(H��A��H���
...%�����A��E��tRIc�H��H��L�s������EX ui    0|
Broken pipe

and the stderr output has:

Can't open |: Broken pipe at count.pl line 12.

It turns out this is a specific problem with using pipes in a Streaming combiner .

Unlike the mapper and reducer , which are allowed to have shell pipes in their commands, combiners cannot. Hadoop Streaming interprets the combiner as the following (pretend $data is the output of the mapper):

cat $data | perl 'count.pl' '-s' '|' 'perl' 'count.pl'

So the count.pl script, which uses perl's <> construct, first parses its command line flags (handling the -s ), then starts reading through $data , then tries to open & read files called | , perl , and count.pl .

Which is why it gets all that crazy stuff in the syslog output, including some stuff from the count.pl script itself.

I just thought this was a crazy enough circumstance that I'd better post it somewhere.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM