hadoop如何读取输入文件？

Question

i have a csv file to analyze with hadoop mapreduce. 我有一个用hadoop mapreduce分析的csv文件。 I am wondering if hadoop will parse it line by line? 我想知道hadoop是否会逐行解析它？ if yes, i want to use string split by comma to get the fields want to analyze. 如果是的话，我想使用逗号分隔字符串来获取想要分析的字段。 or is there other better method of parsing csv and feed it into hadoop? 还是有其他更好的方法来解析csv并将其提供给hadoop？ The file is 10 GB, comma delimited. 该文件是10 GB，逗号分隔。 I want to use java with hadoop. 我想用hadoop使用java。 The parameter "value" of Tex type in the below map() method contains each line that is parsed in by Map/Reduce? 下面的map（）方法中Tex类型的参数“value”包含Map / Reduce解析的每一行？ - this is where I'm most confused about. - 这是我最困惑的地方。

this is my code: 这是我的代码：

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    try {
       String[] tokens = value.toString().split(",");

       String crimeType = tokens[5].trim();      
       int year = Integer.parseInt(tokens[17].trim()); 

       context.write(crimeType, year);

     } catch (Exception e) {...}
 }

Answer 1

Yes, by default Hadoop uses a Text Input reader that feeds the mapper line by line from the input file. 是的，默认情况下，Hadoop使用文本输入阅读器，从输入文件中逐行提供映射器。 The key in the mapper is the offset of the line read. 映射器中的键是读取行的偏移量。 Be careful with CSV files though, as single columns/fields can contain a line break. 但请注意CSV文件，因为单个列/字段可以包含换行符。 You might want to look for a CSV input reader like this one: https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java 您可能希望查找类似这样的CSV输入阅读器： https ： //github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat。 java的

Answer 2

The parameter "value" of Tex type in the below map() method contains each line that is parsed in by Map/Reduce? 下面的map（）方法中Tex类型的参数“value”包含Map / Reduce解析的每一行？ - this is where I'm most confused about. - 这是我最困惑的地方。
Yes(assuming you are using the default InputFormat which is the TextInputFormat ). 是（假设您使用的是默认的InputFormat，即TextInputFormat ）。 The process is a bit more involved though. 但这个过程有点复杂。 It is actually the RecordReader that decides how exactly the InputSplit created by the InputFormat will be sent to the mapper as records(or key/value pairs). 实际上RecordReader决定了InputFormat创建的InputSplit将如何准确地作为记录（或键/值对）发送到映射器。 The TextInputFormat uses LinerecordReader and the entire line is treated as a record. TextInputFormat使用LinerecordReader ，整行被视为记录。 Remember, mapper doesn't process the entire InputSplit all at once. 请记住，mapper不会同时处理整个InputSplit。 It is rather a discrete process wherein an InputSplit is sent to the mapper as Records in order to get processed. 它是一个离散的过程，其中InputSplit作为记录发送到映射器以便进行处理。
I am wondering if hadoop will parse it line by line? 我想知道hadoop是否会逐行解析它？ if yes, i want to use string split by comma to get the fields want to analyze. 如果是的话，我想使用逗号分隔字符串来获取想要分析的字段。
I don't find anything wrong with your approach. 我发现你的方法没有任何问题。 This is how folks usually process csv files. 这就是人们通常处理csv文件的方式。 Read in the lines as Text values , convert them into String and use split() . 在行中读取文本值 ，将它们转换为String并使用split（） 。 One minor suggestion though. 但是有一个小建议。 Convert the Java types into appropriate MA types before you emit them using Context.write() , like crimeType to Text() and year to IntWritable . 在使用Context.write（）将Java类型转换为适当的MA类型之前，将它们转换为适当的MA类型，例如犯罪类型为Text（） ，年份为IntWritable 。

Is this what you need? 这是你需要的吗？

Answer 3

You can use hadoop when you have already parsed and dealt with the csv file. 当您已经解析并处理了csv文件时，可以使用hadoop。 Hadoop needs key-value pairs for map task. Hadoop需要映射任务的键值对。

So use something like opencsv API, for getting the data from the file and provide it to the Hadoop's mapper class in terms of a key/value. 因此，使用类似opencsv API的方法，从文件中获取数据，并根据键/值将其提供给Hadoop的映射器类。

Have a look at this link for detailed explanation. 请查看此链接以获取详细说明。

hadoop如何读取输入文件？

问题描述

3 个解决方案

解决方案1
4 2013-10-20 15:41:51

解决方案2
2 2013-10-20 22:54:15

解决方案3
0 2013-10-19 21:54:16

hadoop如何读取输入文件？

问题描述

3 个解决方案

解决方案1 4 2013-10-20 15:41:51

解决方案2 2 2013-10-20 22:54:15

解决方案3 0 2013-10-19 21:54:16

解决方案1
4 2013-10-20 15:41:51

解决方案2
2 2013-10-20 22:54:15

解决方案3
0 2013-10-19 21:54:16