I have a Mapper:
join1_mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
key_value = line.split(",")
key_in = key_value[0].split(" ")
value_in = key_value[1]
if len(key_in)>=2:
date = key_in[0]
word = key_in[1]
value_out = date+" "+value_in
print( '%s\t%s' % (word, value_out) )
else:
print( '%s\t%s' % (key_in[0], value_in) )
and I have this reducer:
import sys
prev_word = " "
months = ['Jan','Feb','Mar','Apr','Jun','Jul','Aug','Sep','Nov','Dec']
dates_to_output = []
day_cnts_to_output = []
line_cnt = 0
for line in sys.stdin:
line = line.strip()
key_value = line.split('\t')
line_cnt = line_cnt+1
curr_word = key_value[0]
value_in = key_value[1]
if curr_word != prev_word:
if line_cnt>1:
for i in range(len(dates_to_output)):
print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))
dates_to_output =[]
day_cnts_to_output=[]
prev_word =curr_word
if (value_in[0:3] in months):
date_day =value_in.split()
dates_to_output.append(date_day[0])
day_cnts_to_output.append(date_day[1])
else:
curr_word_total_cnt = value_in
for i in range(len(dates_to_output)):
print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))
when I run this JOB:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/input -output /user/cloudera/output_join -mapper /home/cloudera/join1_mapper.py -reducer /home/cloudera/join1_reducer.py
I get the error:
ERROR streaming.StreamJob: Job not successful! Streaming Command Failed!
The first part the log say:
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar] /tmp/streamjob7178107162745054499.jar tmpDir=null15/11/13 02:03:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/11/13 02:03:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/11/13 02:03:43 INFO mapred.FileInputFormat: Total input paths to process : 415/11/13 02:03:43 INFO mapreduce.JobSubmitter: number of splits:515/11/13 02:03:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1445251653083_001315/11/13 02:03:44 INFO impl.YarnClientImpl: Submitted application application_1445251653083_001315/11/13 02:03:44 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1445251653083_0013/15/11/13 02:03:44 INFO mapreduce.Job: Running job: job_1445251653083_001315/11/13 02:03:53 INFO mapreduce.Job: Job job_1445251653083_0013 running in uber mode : false15/11/13 02:03:53 INFO mapreduce.Job: map 0% reduce 0%15/11/13 02:04:19 INFO mapreduce.Job: map 40% reduce 0%15/11/13 02:04:19 INFO mapreduce.Job: Task Id : attempt_1445251653083_0013_m_000002_0, Status : FAILEDError: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)15/11/13 02:04:19 INFO mapreduce.Job: Task Id : attempt_1445251653083_0013_m_000003_0, Status : FAILEDError: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at
I've searched in the url: http://quickstart.cloudera:8088/proxy/application_1445251653083_0013/ any help but for me it is not clear what should I do. I don't understand where the error is. Could someone please help me?
I have solved. In the INPUT Directory from HDFS must be only the TXT-Files from the Calculation. In my case I had other files. I created another directory. After I sent the TXT-Files back into the new directory. I ran the program again in the new HDFS-INPUT directory. Now it worked.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.