[英]Hadoop 2: Empty result when using custom InputFormat
I want to use a own FileInputFormat
with a custom RecordReader
to read csv data into <Long><String>
pairs. 我想使用一个带有自定义
RecordReader
的FileInputFormat
来将csv数据读取为<Long><String>
对。
Therefore I created the class MyTextInputFormat
: 因此,我创建了类
MyTextInputFormat
:
import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
public class MyTextInputFormat extends FileInputFormat<Long, String> {
@Override
public RecordReader<Long, String> getRecordReader(InputSplit input, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(input.toString());
return new MyStringRecordReader(job, (FileSplit)input);
}
@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return super.isSplitable(fs, filename);
}
}
and the class MyStringRecordReader
: 和类
MyStringRecordReader
:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
public class MyStringRecordReader implements RecordReader<Long, String> {
private LineRecordReader lineReader;
private LongWritable lineKey;
private Text lineValue;
public MyStringRecordReader(JobConf job, FileSplit split) throws IOException {
lineReader = new LineRecordReader(job, split);
lineKey = lineReader.createKey();
lineValue = lineReader.createValue();
System.out.println("constructor called");
}
@Override
public void close() throws IOException {
lineReader.close();
}
@Override
public Long createKey() {
return lineKey.get();
}
@Override
public String createValue() {
System.out.println("createValue called");
return lineValue.toString();
}
@Override
public long getPos() throws IOException {
return lineReader.getPos();
}
@Override
public float getProgress() throws IOException {
return lineReader.getProgress();
}
@Override
public boolean next(Long key, String value) throws IOException {
System.out.println("next called");
// get the next line
if (!lineReader.next(lineKey, lineValue)) {
return false;
}
key = lineKey.get();
value = lineValue.toString();
System.out.println(key);
System.out.println(value);
return true;
}
}
In my Spark application I read the file by calling sparkContext.hadoopFile
method. 在我的Spark应用程序中,我通过调用
sparkContext.hadoopFile
方法读取文件。 But I only get an empty output from the following code: 但是我只能从以下代码中得到空的输出 :
public class AssociationRulesAnalysis {
@SuppressWarnings("serial")
public static void main(String[] args) {
JavaRDD<String> inputRdd = sc.hadoopFile(inputFilePath, MyTextInputFormat.class, Long.class, String.class).map(new Function<Tuple2<Long,String>, String>() {
@Override
public String call(Tuple2<Long, String> arg0) throws Exception {
System.out.println("map: " + arg0._2());
return arg0._2();
}
});
List<String> asList = inputRdd.take(10);
for(String s : asList) {
System.out.println(s);
}
}
}
I only get 10 empty lines back from the RDD. 我从RDD中只得到10条空行。
The console output with the added prints
looks the following: 带有
prints
输出的控制台输出看起来如下:
=== APP STARTED : local-1467182320798
constructor called
createValue called
next called
0
ä1
map:
next called
8
ö2
map:
next called
13
ü3
map:
next called
18
ß4
map:
next called
23
ä5
map:
next called
28
ö6
map:
next called
33
ü7
map:
next called
38
ß8
map:
next called
43
ä9
map:
next called
48
ü10
map:
next called
54
ä11
map:
next called
60
ß12
map:
next called
12
=====================
constructor called
createValue called
next called
0
ä1
map:
next called
8
ö2
map:
next called
13
ü3
map:
next called
18
ß4
map:
next called
23
ä5
map:
next called
28
ö6
map:
next called
33
ü7
map:
next called
38
ß8
map:
next called
43
ä9
map:
next called
48
ü10
map:
Stopping...
(The RDD data is printed below the =====
output (10 empty lines!!!). The output above the =====
seems to be made by the RDD.count
call. In the next
method the correct keys & values are shown!? What am I doing wrong? (RDD数据显示在
=====
输出下方(10空行!)。 =====
以上的输出似乎由RDD.count
调用完成。 next
方法中,正确的键&值已显示!?我在做什么错?
lineKey
and lineValue
are never initialized to the key
and value
passed in to the overriden next
method in your MyStringRecordReader
. lineKey
和lineValue
永远不会初始化为MyStringRecordReader
传递给重写的next
方法的key
和value
。 Hence it is always showing the EMPTY string when you try to use your RecordReader
. 因此,当您尝试使用
RecordReader
时,它始终显示EMPTY字符串。 If you want a different key and value for a record in the file then you need to use key and value passed in to the next
method and initialize them with your computed key and value. 如果要为文件中的记录使用其他键和值,则需要使用传递给
next
方法的键和值,并使用计算出的键和值对其进行初始化。 If you do not intend to change the key/value record then get rid of the following. 如果您不打算更改键/值记录,请摆脱以下内容。 Everytime you execute this piece of code you are overwriting key/value read from file with your EMPTY string and 0L.
每次执行这段代码时,您的EMPTY字符串和0L都会覆盖从文件中读取的键/值。
key = lineKey.get();
value = lineValue.toString();
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.