Hadoop-解壓縮的zip文件

Question

我有很多zip格式的壓縮文件（以GB為單位），並且想編寫僅地圖作業來解壓縮它們。 我的映射器類看起來像

import java.util.zip.*;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.OutputCollector;
import java.io.*;

public class DecompressMapper extends Mapper <LongWritable, Text, LongWritable, Text>
{
    private static final int BUFFER_SIZE = 4096;

    public void map(LongWritable key, Text value, OutputCollector<LongWritable, Text> output, Context context) throws IOException
    {
        FileSplit fileSplit = (FileSplit)context.getInputSplit();
        String fileName = fileSplit.getPath().getName();
        this.unzip(fileName, new File(fileName).getParent()  + File.separator +  "/test_poc");  
    }

    public void unzip(String zipFilePath, String destDirectory) throws IOException {
        File destDir = new File(destDirectory);
        if (!destDir.exists()) {
            destDir.mkdir();
        }
        ZipInputStream zipIn = new ZipInputStream(new FileInputStream(zipFilePath));
        ZipEntry entry = zipIn.getNextEntry();
        // iterates over entries in the zip file
        while (entry != null) {
            String filePath = destDirectory + File.separator + entry.getName();
            if (!entry.isDirectory()) {
                // if the entry is a file, extracts it
                extractFile(zipIn, filePath);
            } else {
                // if the entry is a directory, make the directory
                File dir = new File(filePath);
                dir.mkdir();
            }
            zipIn.closeEntry();
            entry = zipIn.getNextEntry();
        }
        zipIn.close();
    }

    private void extractFile(ZipInputStream zipIn, String filePath) throws IOException {
        BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(filePath));
        byte[] bytesIn = new byte[BUFFER_SIZE];
        int read = 0;
        while ((read = zipIn.read(bytesIn)) != -1) {
            bos.write(bytesIn, 0, read);
        }
        bos.close();
    }
}

和我的司機班

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DecompressJob extends Configured implements Tool{

    public static void main(String[] args) throws Exception
    {
      int res = ToolRunner.run(new Configuration(), new DecompressJob(),args);
      System.exit(res);
    }

    public int run(String[] args) throws Exception
    {
        Job conf = Job.getInstance(getConf());
        conf.setJobName("MapperOnly");

        conf.setOutputKeyClass(LongWritable.class);
        conf.setOutputValueClass(Text.class);

        conf.setMapperClass(DecompressMapper.class);
        conf.setNumReduceTasks(0);

        Path inp = new Path(args[0]);
        Path out = new Path(args[1]);

        FileInputFormat.addInputPath(conf, inp);
        FileOutputFormat.setOutputPath(conf, out);

        return conf.waitForCompletion(true) ? 0: 1;
    }
}

看來我的mapper類工作不正常。 我沒有在所需目錄中解壓縮的文件。 任何幫助表示贊賞。 謝謝...

Answer 1

上面的代碼有幾個問題

我正在將MR1 API與MR2 API結合使用。 絕對不要那樣做。
使用的Java IO功能。 Hadoop無法在其文件系統中識別Java IO功能。
路徑生成不正確。

我們在編寫map reduce程序時需要小心，因為hadoop使用完全不同的文件系統，並且在編寫代碼時必須考慮這一點，並且切勿混用MR1和MR2 API。

Answer 2

好的，沒有具體的方法可以在hadoop文件系統中解壓縮文件，但是經過長時間的研究，我想出了如何直接在hadoop文件系統中解壓縮文件的條件。前提是您需要將zip文件復制到特定位置然后運行mapreduce工作。 顯而易見，hadoop無法理解zipfile輸入格式，因此我們需要自定義Mapper和reducer，以便我們可以控制mapper發出和reducer消耗的內容。 請注意，此Mapreduce將在單個Mapper上運行，因為自定義hadoop提供的Record Reader類時，我們將禁用split方法，即使其變為false。 因此，Mapreduce將把文件名作為鍵 ，將未壓縮文件的內容作為值。 當reducer消耗掉時，我將輸出outputkey設置為null，因此只有未壓縮的內容保留在reducer中，並且reducer的數量設置為1，因此所有轉儲都在一個零件文件中。

我們都知道hadoop無法獨自處理zip文件，但是java可以借助其自己的ZipFile類進行處理，該類可以通過zipinputstrem讀取zip文件內容，並通過zipentry讀取zip條目，因此我們編寫了一個自定義的ZipInputFormat類，該類擴展了FileInputFormat。

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

public class ZipFileInputFormat extends FileInputFormat<Text, BytesWritable> {
/** See the comments on the setLenient() method */
private static boolean isLenient = false;

/**
 * ZIP files are not splitable so they cannot be overrided so function
 * return false
 */
@Override
protected boolean isSplitable(JobContext context, Path filename) {
    return false;
}

/**
 * Create the ZipFileRecordReader to parse the file
 */
@Override
public RecordReader<Text, BytesWritable> createRecordReader(
        InputSplit split, TaskAttemptContext context) throws IOException,
        InterruptedException {
    return new ZipFileRecordReader();
}

/**
 * 
 * @param lenient
 */
public static void setLenient(boolean lenient) {
    isLenient = lenient;
}

public static boolean getLenient() {
    return isLenient;
}
}

請注意，RecordReader類返回ZipFileRecordReadeader，這是我們正在討論的Hadoop RecordReader類的自定義版本。現在讓我們稍微簡化一下RecordReader類

import java.io.IOException;
import java.io.ByteArrayOutputStream;
import java.io.EOFException;
import java.io.IOException;
import java.util.zip.ZipEntry;
import java.util.zip.ZipException;
import java.util.zip.ZipInputStream;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class ZipFileRecordReader extends RecordReader<Text, BytesWritable> {
/** InputStream used to read the ZIP file from the FileSystem */
private FSDataInputStream fsin;

/** ZIP file parser/decompresser */
private ZipInputStream zip;

/** Uncompressed file name */
private Text currentKey;

/** Uncompressed file contents */
private BytesWritable currentValue;

/** Used to indicate progress */
private boolean isFinished = false;

/**
 * Initialise and open the ZIP file from the FileSystem
 */
@Override
public void initialize(InputSplit inputSplit,
        TaskAttemptContext taskAttemptContext) throws IOException,
        InterruptedException {
    FileSplit split = (FileSplit) inputSplit;
    Configuration conf = taskAttemptContext.getConfiguration();
    Path path = split.getPath();
    FileSystem fs = path.getFileSystem(conf);

    // Open the stream
    fsin = fs.open(path);
    zip = new ZipInputStream(fsin);
}

/**
 * Each ZipEntry is decompressed and readied for the Mapper. The contents of
 * each file is held *in memory* in a BytesWritable object.
 * 
 * If the ZipFileInputFormat has been set to Lenient (not the default),
 * certain exceptions will be gracefully ignored to prevent a larger job
 * from failing.
 */
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
    ZipEntry entry = null;
    try {
        entry = zip.getNextEntry();
    } catch (ZipException e) {
        if (ZipFileInputFormat.getLenient() == false)
            throw e;
    }

    // Sanity check
    if (entry == null) {
        isFinished = true;
        return false;
    }

    // Filename
    currentKey = new Text(entry.getName());

    if (currentKey.toString().endsWith(".zip")) {
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        byte[] temp1 = new byte[8192];
        while (true) {
            int bytesread1 = 0;
            try {
                bytesread1 = zip.read(temp1, 0, 8192);
            } catch (EOFException e) {
                if (ZipFileInputFormat.getLenient() == false)
                    throw e;
                return false;
            }
            if (bytesread1 > 0)
                bos.write(temp1, 0, bytesread1);
            else
                break;
        }

        zip.closeEntry();
        currentValue = new BytesWritable(bos.toByteArray());
        return true;

    }

    // Read the file contents
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    byte[] temp = new byte[8192];
    while (true) {
        int bytesRead = 0;
        try {
            bytesRead = zip.read(temp, 0, 8192);
        } catch (EOFException e) {
            if (ZipFileInputFormat.getLenient() == false)
                throw e;
            return false;
        }
        if (bytesRead > 0)
            bos.write(temp, 0, bytesRead);
        else
            break;
    }
    zip.closeEntry();

    // Uncompressed contents
    currentValue = new BytesWritable(bos.toByteArray());
    return true;
}

/**
 * Rather than calculating progress, we just keep it simple
 */
@Override
public float getProgress() throws IOException, InterruptedException {
    return isFinished ? 1 : 0;
}

/**
 * Returns the current key (name of the zipped file)
 */
@Override
public Text getCurrentKey() throws IOException, InterruptedException {
    return currentKey;
}

/**
 * Returns the current value (contents of the zipped file)
 */
@Override
public BytesWritable getCurrentValue() throws IOException,
        InterruptedException {
    return currentValue;
}

/**
 * Close quietly, ignoring any exceptions
 */
@Override
public void close() throws IOException {
    try {
        zip.close();
    } catch (Exception ignore) {
    }
    try {
        fsin.close();
    } catch (Exception ignore) {
    }
}
}

為了方便起見，我在源代碼中給出了一些注釋，以便您可以輕松了解如何使用緩沖存儲器讀取和寫入文件。現在讓我們將上述的Mapper類寫入類

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.BytesWritable; 
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper; 

public class MyMapper extends Mapper<Text, BytesWritable, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Text key, BytesWritable value, Context context)
        throws IOException, InterruptedException {

    String filename = key.toString();

    // We only want to process .txt files
    if (filename.endsWith(".txt") == false)
        return;

    // Prepare the content
    String content = new String(value.getBytes(), "UTF-8");

    context.write(new Text(content), one);
}
}

讓我們快速編寫相同的Reducer

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
        sum += val.get();
    }
    // context.write(key, new IntWritable(sum));
    context.write(new Text(key), null);
}
}

讓我們快速配置Mapper和Reducer的Job

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import com.saama.CustomisedMapperReducer.MyMapper;
import com.saama.CustomisedMapperReducer.MyReducer;
import com.saama.CustomisedMapperReducer.ZipFileInputFormat;
import com.saama.CustomisedMapperReducer.ZipFileRecordReader;

public class MyJob {

@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException,
        ClassNotFoundException, InterruptedException {
    Configuration conf = new Configuration();

    Job job = new Job(conf);
    job.setJarByClass(MyJob.class);
    job.setMapperClass(MyMapper.class);
    job.setReducerClass(MyReducer.class);

    job.setInputFormatClass(ZipFileInputFormat.class);
    job.setOutputKeyClass(TextOutputFormat.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    ZipFileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.setNumReduceTasks(1);

    job.waitForCompletion(true);

}
}

請注意，在作業類中，我們已將InputFormatClass配置為ZipFileInputFormat類，而OutputFormatClass是TextOutPutFormat類。

Mavenize Project並讓依賴項保持原樣運行代碼，導出Jar文件並將其部署在hadoop集群上。 在CDH5.5 YARN上測試和部署。 POM文件的內容如下

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>com.mithun</groupId>
<artifactId>CustomisedMapperReducer</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>

<name>CustomisedMapperReducer</name>
<url>http://maven.apache.org</url>

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.6.0</version>
    </dependency>

    <dependency>
        <groupId>org.codehaus.jackson</groupId>
        <artifactId>jackson-mapper-asl</artifactId>
        <version>1.9.13</version>
    </dependency>


    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>3.8.1</version>
        <scope>test</scope>
    </dependency>
</dependencies>
</project>

Hadoop-解壓縮的zip文件

問題描述

2 個解決方案

解決方案1
2 已采納 2015-09-25 18:05:27

解決方案2
2 2016-04-06 12:22:09

Hadoop-解壓縮的zip文件

問題描述

2 個解決方案

解決方案1 2 已采納 2015-09-25 18:05:27

解決方案2 2 2016-04-06 12:22:09

解決方案1
2 已采納 2015-09-25 18:05:27

解決方案2
2 2016-04-06 12:22:09