[英]How do I read CSV file with Distributed Cache and ToolRunner on the Command Line in Hadoop
我是 Hadoop 的初学者。 我最近了解到,您可以使用 ToolRunner 和分布式缓存通过 Hadoop 中的命令行读取文件。 我想知道如何实现这个以读取 CSV 文件。 我主要对输入目录应该是什么感到困惑,因为 ToolRunner 中的 -files 方法应该即时添加文件对吗? 有人可以帮忙吗?
ToolRunner / 驱动程序
package stubs;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class AvgWordLength extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new AvgWordLength(), args);
System.exit(exitCode);
}
@Override
public int run(String[] args) throws Exception {
/*
* Validate that two arguments were passed from the command line.
*/
if (args.length != 2) {
System.out.printf("Usage: AvgWordLength <input dir> <output dir>\n");
System.exit(-1);
}
/*
* Instantiate a Job object for your job's configuration.
*/
Job job = Job.getInstance(getConf());
/*
* Specify the jar file that contains your driver, mapper, and reducer.
* Hadoop will transfer this jar file to nodes in your cluster running
* mapper and reducer tasks.
*/
job.setJarByClass(AvgWordLength.class);
/*
* Specify an easily-decipherable name for the job. This job name will
* appear in reports and logs.
*/
job.setJobName("Average Word Length");
/*
* Specify the paths to the input and output data based on the
* command-line arguments.
*/
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
/*
* Specify the mapper and reducer classes.
*/
job.setMapperClass(LetterMapper.class);
job.setReducerClass(AverageReducer.class);
/*
* The input file and output files are text files, so there is no need
* to call the setInputFormatClass and setOutputFormatClass methods.
*/
/*
* The mapper's output keys and values have different data types than
* the reducer's output keys and values. Therefore, you must call the
* setMapOutputKeyClass and setMapOutputValueClass methods.
*/
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
/*
* Specify the job's output key and value classes.
*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
/*
* Start the MapReduce job and wait for it to finish. If it finishes
* successfully, return 0. If not, return 1.
*/
boolean success = job.waitForCompletion(true);
return (success ? 0 : 1);
}
}
映射器
package stubs;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.conf.Configuration;
/**
* To define a map function for your MapReduce job, subclass the Mapper class
* and override the map method. The class definition requires four parameters:
*
* @param The
* data type of the input key - LongWritable
* @param The
* data type of the input value - Text
* @param The
* data type of the output key - Text
* @param The
* data type of the output value - IntWritable
*/
public class LetterMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
/**
* The map method runs once for each line of text in the input file. The
* method receives:
*
* @param A
* key of type LongWritable
* @param A
* value of type Text
* @param A
* Context object.
*/
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
/*
* Convert the line, which is received as a Text object, to a String
* object.
*/
String line = value.toString();
/*
* The line.split("\\W+") call uses regular expressions to split the
* line up by non-word characters. If you are not familiar with the use
* of regular expressions in Java code, search the web for
* "Java Regex Tutorial."
*/
if(line.contains("Colour,"){
String[] word = line.split(",");
String[] genre = word[9].split("\\|");
for(int i = 0; i < (genre.length) -1 ; i++){
context.write(new Text(genre[i]), new IntWritable(word.length()));
}
}
}
}
我使用的命令:
hadoop jar gc.jar stubs.AvgWordLength -files /home/training/training_materials/developer/data/movies_metadata.csv inputs gc_res
Inputs 是一个空目录。 如果我不放入输入目录,整个程序就会挂起。
命令(jar)和命令选项(类和您的两个路径,给 args 数组)之前的通用选项(例如文件)go。
hadoop -files /home/training/training_materials/developer/data/movies_metadata.csv jar gc.jar stubs.AvgWordLength inputs gc_res
但是文件只是将文件复制到集群中,而不是分布式缓存。
根据我的经验,缓存主要用作跨 mapreduce 阶段共享数据的地方,而不是在运行代码之前跳过hdfs put
命令的替代品。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.