在Spark中将纯文本文件转换为Hadoop序列文件

Question

My existing project is using Hadoop map-reduce to generate a sequence file having a custom key and value which is in XML format. 我现有的项目正在使用Hadoop map-reduce生成具有XML格式的自定义键和值的序列文件。

The XML value is generated by reading one line at a time from the input source and the RecordReader is implemented to return the next value in XML format from the plain text. 通过一次从输入源读取一行来生成XML值，并实现RecordReader以从纯文本返回XML格式的下一个值。

eg Input source file has 3 rows (1st row is the header and rest rows is having actual data) 例如，输入源文件有3行（第一行是标题，其余行具有实际数据）

id|name|value
1|Vijay|1000
2|Gaurav|2000
3|Ashok|3000

Post the map method the sequence file has data as below: 发布map方法后，序列文件中的数据如下：

FeedInstanceKey{feedInstanceId=1000, entity=bars}   <?xml version='1.0' encoding='UTF-8'?><bars><id>1</id><name>Vijay</name><value>1000</value></bars>
FeedInstanceKey{feedInstanceId=1000, entity=bars}   <?xml version='1.0' encoding='UTF-8'?><bars><id>2</id><name>Gaurav</name><value>2000</value></bars>
FeedInstanceKey{feedInstanceId=1000, entity=bars}   <?xml version='1.0' encoding='UTF-8'?><bars><id>3</id><name>Ashok</name><value>3000</value></bars>

Question: I wish to implement the same in Spark. 问题：我希望在Spark中实现相同的功能。 Basically, read the input file and generate the key value pair as above. 基本上，读取输入文件并如上所述生成键值对。

Is there any way/possible to reuse the existing InputFormat and hence the RecordReader which is used in my Hadoop mapper class. 有什么方法/可能重用现有的InputFormat以及因此在我的Hadoop映射器类中使用的RecordReader。

The RecordReader is responsible/having the logic to convert the plain text row to XML and return as value to Hadoop map method for writing in context.write() method. RecordReader负责/具有将纯文本行转换为XML并作为值返回到Hadoop映射方法以在context.write()方法中进行写入的逻辑。

Kindly suggest. 请提示。

Answer 1

This is covered in the Spark documentation in the External Datasets section. Spark文档在“ 外部数据集”部分中对此进行了介绍。 The important part for you is: 对您来说重要的部分是：

For other Hadoop InputFormats, you can use the JavaSparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. 对于其他Hadoop InputFormat，可以使用JavaSparkContext.hadoopRDD方法，该方法采用任意JobJob和输入格式类，键类和值类。 Set these the same way you would for a Hadoop job with your input source. 使用与使用输入源进行Hadoop作业相同的方式设置这些内容。 You can also use JavaSparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce). 您还可以基于“新” MapReduce API（org.apache.hadoop.mapreduce）将JavaSparkContext.newAPIHadoopRDD用于InputFormats。

Here's a simple example demostrating how to use it: 这是一个简单的示例，说明了如何使用它：

public final class ExampleSpark {

    public static void main(String[] args) throws Exception {
        JavaSparkContext spark = new JavaSparkContext();
        Configuration jobConf = new Configuration();

        JavaPairRDD<LongWritable, Text> inputRDD = spark.newAPIHadoopFile(args[0], TextInputFormat.class, LongWritable.class, Text.class, jobConf);
        System.out.println(inputRDD.count());

        spark.stop();
        System.exit(0);
    }
}

You can see the Javadocs for JavaSparkContext here . 您可以在此处看到Javadocs for JavaSparkContext。

在Spark中将纯文本文件转换为Hadoop序列文件

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-06-21 08:01:13

在Spark中将纯文本文件转换为Hadoop序列文件

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-06-21 08:01:13

解决方案1
3 已采纳 2017-06-21 08:01:13