Hadoop ERROR streaming

Question

I have a Mapper:

join1_mapper.py

#!/usr/bin/env python
import sys

for line in sys.stdin:

    line       = line.strip()   
    key_value  = line.split(",")   
    key_in     = key_value[0].split(" ")  
    value_in   = key_value[1]  

    if len(key_in)>=2:         
        date = key_in[0]      
        word = key_in[1]
        value_out = date+" "+value_in     
        print( '%s\t%s' % (word, value_out) ) 
    else:  
        print( '%s\t%s' % (key_in[0], value_in) )

and I have this reducer:

import sys

prev_word          = "  "                
months             = ['Jan','Feb','Mar','Apr','Jun','Jul','Aug','Sep','Nov','Dec']
dates_to_output    = [] 
day_cnts_to_output = [] 
line_cnt           = 0  

for line in sys.stdin:

    line       = line.strip()       
    key_value  = line.split('\t')   
    line_cnt   = line_cnt+1     

    curr_word  = key_value[0]         
    value_in   = key_value[1]       

    if curr_word != prev_word:

        if line_cnt>1:
            for i in range(len(dates_to_output)): 
                 print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))

            dates_to_output   =[]
            day_cnts_to_output=[]

        prev_word         =curr_word  



    if (value_in[0:3] in months): 

        date_day =value_in.split() 


        dates_to_output.append(date_day[0])
        day_cnts_to_output.append(date_day[1])
    else:
        curr_word_total_cnt = value_in  

for i in range(len(dates_to_output)):  
        print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))

when I run this JOB:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/input -output /user/cloudera/output_join -mapper /home/cloudera/join1_mapper.py -reducer /home/cloudera/join1_reducer.py

I get the error:

 ERROR streaming.StreamJob: Job not successful! Streaming Command Failed!

The first part the log say:

 packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar] /tmp/streamjob7178107162745054499.jar tmpDir=null15/11/13 02:03:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/11/13 02:03:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/11/13 02:03:43 INFO mapred.FileInputFormat: Total input paths to process : 415/11/13 02:03:43 INFO mapreduce.JobSubmitter: number of splits:515/11/13 02:03:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1445251653083_001315/11/13 02:03:44 INFO impl.YarnClientImpl: Submitted application application_1445251653083_001315/11/13 02:03:44 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1445251653083_0013/15/11/13 02:03:44 INFO mapreduce.Job: Running job: job_1445251653083_001315/11/13 02:03:53 INFO mapreduce.Job: Job job_1445251653083_0013 running in uber mode : false15/11/13 02:03:53 INFO mapreduce.Job: map 0% reduce 0%15/11/13 02:04:19 INFO mapreduce.Job: map 40% reduce 0%15/11/13 02:04:19 INFO mapreduce.Job: Task Id : attempt_1445251653083_0013_m_000002_0, Status : FAILEDError: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)15/11/13 02:04:19 INFO mapreduce.Job: Task Id : attempt_1445251653083_0013_m_000003_0, Status : FAILEDError: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at

I've searched in the url: http://quickstart.cloudera:8088/proxy/application_1445251653083_0013/ any help but for me it is not clear what should I do. I don't understand where the error is. Could someone please help me?

Answer 1

I have solved. In the INPUT Directory from HDFS must be only the TXT-Files from the Calculation. In my case I had other files. I created another directory. After I sent the TXT-Files back into the new directory. I ran the program again in the new HDFS-INPUT directory. Now it worked.

Hadoop ERROR streaming

Question

1 answers

solution1
0 ACCPTED 2015-11-14 16:08:02

Hadoop ERROR streaming

Question

1 answers

solution1 0 ACCPTED 2015-11-14 16:08:02

solution1
0 ACCPTED 2015-11-14 16:08:02