[英]Sorted Hadoop WordCount Java
我正在用Java運行Hadoop的WordCount程序,我的第一項工作(獲取所有單詞及其計數)工作正常。 但是,當我做第二項工作時,我遇到了一個問題,該問題應該按照發生的次數進行排序。 我已經讀過這個問題( 按單詞出現的Hadoop WordCount排序 ),以了解如何進行第二份工作,但是我沒有同樣的問題。
我的代碼:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class simpleWordExample {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value:values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
class Map1 extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer stringTokenizer = new StringTokenizer(line);
while (stringTokenizer.hasMoreTokens()){
int number = 999;
String word = "empty";
if (stringTokenizer.hasMoreTokens()) {
String str0 = stringTokenizer.nextToken();
word = str0.trim();
}
if (stringTokenizer.hasMoreElements()) {
String str1 = stringTokenizer.nextToken();
number = Integer.parseInt(str1.trim());
}
context.write(new Text(word), new IntWritable(number));
}
}
}
class Reduce1 extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for (IntWritable value:values) {
context.write(key, new IntWritable(value.get()));
}
}
}
public static void main(String[] args) throws Exception {
Job job1 = new Job();
Job job2 = new Job();
job1.setJobName("wordCount");
job1.setJarByClass(simpleWordExample.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
job1.setMapperClass(Map.class);
job1.setCombinerClass(Reduce.class);
job1.setReducerClass(Reduce.class);
job1.setInputFormatClass(TextInputFormat.class);
job1.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job1, new Path("file:///home/cloudera/data.txt"));
FileOutputFormat.setOutputPath(job1, new Path("file:///home/cloudera/output"));
job2.setJobName("WordCount1");
job2.setJarByClass(simpleWordExample.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(IntWritable.class);
job2.setMapperClass(Map1.class);
job2.setCombinerClass(Reduce1.class);
job2.setReducerClass(Reduce1.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job2, new Path("file:///home/cloudera/output/part-00000"));
FileOutputFormat.setOutputPath(job2, new Path("file:///home/cloudera/outputFinal"));
job1.submit();
if (job1.waitForCompletion(true)) {
job2.submit();
job2.waitForCompletion(true);
}
}
}
和我在控制台中得到的錯誤:
15/05/02 09:56:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/05/02 09:56:37 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id
15/05/02 09:56:37 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/05/02 09:56:39 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/05/02 09:56:39 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
15/05/02 09:56:39 INFO input.FileInputFormat: Total input paths to process : 1
15/05/02 09:56:41 INFO mapreduce.JobSubmitter: number of splits:1
15/05/02 09:56:41 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/05/02 09:56:41 WARN conf.Configuration: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class
15/05/02 09:56:41 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
15/05/02 09:56:41 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
15/05/02 09:56:41 WARN conf.Configuration: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
15/05/02 09:56:41 WARN conf.Configuration: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
15/05/02 09:56:41 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
15/05/02 09:56:41 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/05/02 09:56:41 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
15/05/02 09:56:41 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/05/02 09:56:41 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/05/02 09:56:41 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/05/02 09:56:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1998350370_0001
15/05/02 09:56:48 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/05/02 09:56:48 INFO mapreduce.Job: Running job: job_local1998350370_0001
15/05/02 09:56:48 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/05/02 09:56:48 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/05/02 09:56:48 INFO mapred.LocalJobRunner: Waiting for map tasks
15/05/02 09:56:48 INFO mapred.LocalJobRunner: Starting task: attempt_local1998350370_0001_m_000000_0
15/05/02 09:56:48 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
15/05/02 09:56:48 INFO mapred.MapTask: Processing split: file:/home/cloudera/data.txt:0+1528889
15/05/02 09:56:48 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/05/02 09:56:52 INFO mapreduce.Job: Job job_local1998350370_0001 running in uber mode : false
15/05/02 09:56:52 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/05/02 09:56:52 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/05/02 09:56:52 INFO mapred.MapTask: soft limit at 83886080
15/05/02 09:56:52 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/05/02 09:56:52 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/05/02 09:56:52 INFO mapreduce.Job: map 0% reduce 0%
15/05/02 09:56:57 INFO mapred.LocalJobRunner:
15/05/02 09:56:57 INFO mapred.MapTask: Starting flush of map output
15/05/02 09:56:57 INFO mapred.MapTask: Spilling map output
15/05/02 09:56:57 INFO mapred.MapTask: bufstart = 0; bufend = 2109573; bufvoid = 104857600
15/05/02 09:56:57 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25406616(101626464); length = 807781/6553600
15/05/02 09:56:58 INFO mapred.LocalJobRunner: map > sort
15/05/02 09:56:58 INFO mapreduce.Job: map 67% reduce 0%
15/05/02 09:56:59 INFO mapred.LocalJobRunner: Map task executor complete.
15/05/02 09:56:59 WARN mapred.LocalJobRunner: job_local1998350370_0001
java.lang.Exception: java.lang.RuntimeException: java.lang.NoSuchMethodException: simpleWordExample$Reduce.<init>()
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:401)
Caused by: java.lang.RuntimeException: java.lang.NoSuchMethodException: simpleWordExample$Reduce.<init>()
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1619)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1603)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1452)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:693)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:761)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:233)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NoSuchMethodException: simpleWordExample$Reduce.<init>()
at java.lang.Class.getConstructor0(Class.java:2706)
at java.lang.Class.getDeclaredConstructor(Class.java:1985)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)
... 13 more
15/05/02 09:57:00 INFO mapreduce.Job: Job job_local1998350370_0001 failed with state FAILED due to: NA
15/05/02 09:57:00 INFO mapreduce.Job: Counters: 21
File System Counters
FILE: Number of bytes read=1529039
FILE: Number of bytes written=174506
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=30292
Map output records=201946
Map output bytes=2109573
Map output materialized bytes=0
Input split bytes=93
Combine input records=0
Combine output records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=122
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=165613568
File Input Format Counters
Bytes Read=1528889
感謝您的時間和幫助!
全局編輯:使用新的API
我自己從未使用過hadoop,但似乎hadoop試圖使用默認的no-args構造函數實例化“ Map”實例。 它拋出NoSuchMethodException,因為它找不到無參數的構造函數。
基於代碼中的以下幾行:
-- Map 1:
class Map1 extends MapReduceBase implements Mapper<Object, Text, IntWritable, Text>
-- From Driver
conf2.setInputFormat(TextInputFormat.class);
當您將輸入格式設置為TextInputFormat時,映射鍵始終為LongWritable且值為Text。 您已經在Map類中正確使用了TextInputFormat。
這可能是由於API的混合和匹配。 hadoop有2個API,分別是較早的mapred和最新的mapreduce。 在您的代碼中,您將同時導入它們。 嘗試發表評論
import org.apache.hadoop.mapred.*;
並導入新API所需的所有文件。 希望這對您有用。
發表評論后,請嘗試根據新的API編寫代碼。
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value:values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
我已經為您編寫了第一個Mapper和Reducer,您可以為第二個Mapper和Reducer做同樣的事情。
Map1 , Reduce1和Reduce類必須是靜態的 。
job2配置中也存在錯誤:
job2.setOutputKeyClass(Text.class); job2.setOutputValueClass(IntWritable.class);
您應該將其更改為:
job2.setOutputKeyClass(IntWritable.class); job2.setOutputValueClass(Text.class);
我希望這有幫助。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.