简体   繁体   English

如何将.txt文件转换为Hadoop的序列文件格式

[英]How to convert .txt file to Hadoop's sequence file format

To effectively utilise map-reduce jobs in Hadoop , i need data to be stored in hadoop's sequence file format . 为了在Hadoop中有效地利用map-reduce作业,我需要以hadoop的序列文件格式存储数据。 However,currently the data is only in flat .txt format.Can anyone suggest a way i can convert a .txt file to a sequence file? 但是,目前数据只是平面.txt格式。任何人都建议我可以将.txt文件转换为序列文件?

So the way more simplest answer is just an "identity" job that has a SequenceFile output. 因此,最简单的答案只是一个具有SequenceFile输出的“身份”作业。

Looks like this in java: 在java中看起来像这样:

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Convert Text");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    // increase if you need sorting or a special number of files
    job.setNumReduceTasks(0);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path("/lol"));
    SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));

    // submit and wait for completion
    job.waitForCompletion(true);
   }
import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition. 

public class SequenceFileWriteDemo { 

    private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

    public static void main( String[] args) throws IOException { 
        String uri = args[ 0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create( uri), conf);
        Path path = new Path( uri);
        IntWritable key = new IntWritable();
        Text value = new Text();
        SequenceFile.Writer writer = null;
        try { 
            writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass());
            for (int i = 0; i < 100; i ++) { 
                key.set( 100 - i);
                value.set( DATA[ i % DATA.length]);
                System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); 
                writer.append( key, value); } 
        } finally 
        { IOUtils.closeStream( writer); 
        } 
    } 
}

It depends on what the format of the TXT file is. 这取决于TXT文件的格式。 Is it one line per record? 每条记录是一行吗? If so, you can simply use TextInputFormat which creates one record for each line. 如果是这样,您可以简单地使用TextInputFormat,为每行创建一条记录。 In your mapper you can parse that line and use it whichever way you choose. 在您的映射器中,您可以解析该行并以您选择的方式使用它。

If it isn't one line per record, you might need to write your own InputFormat implementation. 如果每条记录不是一行,则可能需要编写自己的InputFormat实现。 Take a look at this tutorial for more info. 有关详细信息,请查看本教程

You can also just create an intermediate table, LOAD DATA the csv contents straight into it, then create a second table as sequencefile (partitioned, clustered, etc..) and insert into select from the intermediate table. 您也可以创建一个中间表,直接将csv内容LOAD DATA加载到其中,然后创建第二个表作为sequencefile(分区,聚类等)并插入到中间表中的select中。 You can also set options for compression, eg, 您还可以设置压缩选项,例如,

set hive.exec.compress.output = true;
set io.seqfile.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;

create table... stored as sequencefile;

insert overwrite table ... select * from ...;

The MR framework will then take care of the heavylifting for you, saving you the trouble of having to write Java code. 然后,MR框架将为您处理重型提升,为您节省编写Java代码的麻烦。

Be watchful with format specifier : . 注意格式说明符:

For example (note the space between % and s ), System.out.printf("[% s]\\t% s\\t% s\\n", writer.getLength(), key, value); 例如(注意%s之间的空格), System.out.printf("[% s]\\t% s\\t% s\\n", writer.getLength(), key, value); will give us java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags = 将给我们java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =

Instead, we should use: 相反,我们应该使用:

System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value); 

如果你安装了Mahout - 它有一个叫做seqdirectory的东西 - 它可以做到

If your data is not on HDFS, you need to upload it to HDFS. 如果您的数据不在HDFS上,则需要将其上传到HDFS。 Two options: 两种选择:

i) hdfs -put on your .txt file and once you get it on HDFS, you can convert it to seq file. i)hdfs -put在.txt文件上,一旦你在HDFS上获得它,你可以将它转换为seq文件。

ii) You take text file as input on your HDFS Client box and convert to SeqFile using Sequence File APIs by creating a SequenceFile.Writer and appending (key,values) to it. ii)您将文本文件作为输入放在HDFS客户端盒上,并通过创建SequenceFile.Writer并附加(键,值)来使用序列文件API转换为SeqFile。

If you don't care about key, u can make line number as key and complete text as value. 如果您不关心密钥,您可以将行号作为键,将完整文本作为值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM