简体   繁体   English

在Spark中将纯文本文件转换为Hadoop序列文件

[英]Convert plain text file to Hadoop sequence file in Spark

My existing project is using Hadoop map-reduce to generate a sequence file having a custom key and value which is in XML format. 我现有的项目正在使用Hadoop map-reduce生成具有XML格式的自定义键和值的序列文件。

The XML value is generated by reading one line at a time from the input source and the RecordReader is implemented to return the next value in XML format from the plain text. 通过一次从输入源读取一行来生成XML值,并实现RecordReader以从纯文本返回XML格式的下一个值。

eg Input source file has 3 rows (1st row is the header and rest rows is having actual data) 例如,输入源文件有3行(第一行是标题,其余行具有实际数据)

id|name|value
1|Vijay|1000
2|Gaurav|2000
3|Ashok|3000

Post the map method the sequence file has data as below: 发布map方法后,序列文件中的数据如下:

FeedInstanceKey{feedInstanceId=1000, entity=bars}   <?xml version='1.0' encoding='UTF-8'?><bars><id>1</id><name>Vijay</name><value>1000</value></bars>
FeedInstanceKey{feedInstanceId=1000, entity=bars}   <?xml version='1.0' encoding='UTF-8'?><bars><id>2</id><name>Gaurav</name><value>2000</value></bars>
FeedInstanceKey{feedInstanceId=1000, entity=bars}   <?xml version='1.0' encoding='UTF-8'?><bars><id>3</id><name>Ashok</name><value>3000</value></bars>

Question: I wish to implement the same in Spark. 问题:我希望在Spark中实现相同的功能。 Basically, read the input file and generate the key value pair as above. 基本上,读取输入文件并如上所述生成键值对。

Is there any way/possible to reuse the existing InputFormat and hence the RecordReader which is used in my Hadoop mapper class. 有什么方法/可能重用现有的InputFormat以及因此在我的Hadoop映射器类中使用的RecordReader。

The RecordReader is responsible/having the logic to convert the plain text row to XML and return as value to Hadoop map method for writing in context.write() method. RecordReader负责/具有将纯文本行转换为XML并作为值返回到Hadoop映射方法以在context.write()方法中进行写入的逻辑。

Kindly suggest. 请提示。

This is covered in the Spark documentation in the External Datasets section. Spark文档在“ 外部数据集”部分中对此进行了介绍。 The important part for you is: 对您来说重要的部分是:

For other Hadoop InputFormats, you can use the JavaSparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. 对于其他Hadoop InputFormat,可以使用JavaSparkContext.hadoopRDD方法,该方法采用任意JobJob和输入格式类,键类和值类。 Set these the same way you would for a Hadoop job with your input source. 使用与使用输入源进行Hadoop作业相同的方式设置这些内容。 You can also use JavaSparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce). 您还可以基于“新” MapReduce API(org.apache.hadoop.mapreduce)将JavaSparkContext.newAPIHadoopRDD用于InputFormats。

Here's a simple example demostrating how to use it: 这是一个简单的示例,说明了如何使用它:

public final class ExampleSpark {

    public static void main(String[] args) throws Exception {
        JavaSparkContext spark = new JavaSparkContext();
        Configuration jobConf = new Configuration();

        JavaPairRDD<LongWritable, Text> inputRDD = spark.newAPIHadoopFile(args[0], TextInputFormat.class, LongWritable.class, Text.class, jobConf);
        System.out.println(inputRDD.count());

        spark.stop();
        System.exit(0);
    }
}

You can see the Javadocs for JavaSparkContext here . 您可以在此处看到Javadocs for JavaSparkContext。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM