[英]Hadoop - Decompressed zip files
我有很多zip格式的压缩文件(以GB为单位),并且想编写仅地图作业来解压缩它们。 我的映射器类看起来像
import java.util.zip.*;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.OutputCollector;
import java.io.*;
public class DecompressMapper extends Mapper <LongWritable, Text, LongWritable, Text>
{
private static final int BUFFER_SIZE = 4096;
public void map(LongWritable key, Text value, OutputCollector<LongWritable, Text> output, Context context) throws IOException
{
FileSplit fileSplit = (FileSplit)context.getInputSplit();
String fileName = fileSplit.getPath().getName();
this.unzip(fileName, new File(fileName).getParent() + File.separator + "/test_poc");
}
public void unzip(String zipFilePath, String destDirectory) throws IOException {
File destDir = new File(destDirectory);
if (!destDir.exists()) {
destDir.mkdir();
}
ZipInputStream zipIn = new ZipInputStream(new FileInputStream(zipFilePath));
ZipEntry entry = zipIn.getNextEntry();
// iterates over entries in the zip file
while (entry != null) {
String filePath = destDirectory + File.separator + entry.getName();
if (!entry.isDirectory()) {
// if the entry is a file, extracts it
extractFile(zipIn, filePath);
} else {
// if the entry is a directory, make the directory
File dir = new File(filePath);
dir.mkdir();
}
zipIn.closeEntry();
entry = zipIn.getNextEntry();
}
zipIn.close();
}
private void extractFile(ZipInputStream zipIn, String filePath) throws IOException {
BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(filePath));
byte[] bytesIn = new byte[BUFFER_SIZE];
int read = 0;
while ((read = zipIn.read(bytesIn)) != -1) {
bos.write(bytesIn, 0, read);
}
bos.close();
}
}
和我的司机班
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class DecompressJob extends Configured implements Tool{
public static void main(String[] args) throws Exception
{
int res = ToolRunner.run(new Configuration(), new DecompressJob(),args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
Job conf = Job.getInstance(getConf());
conf.setJobName("MapperOnly");
conf.setOutputKeyClass(LongWritable.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(DecompressMapper.class);
conf.setNumReduceTasks(0);
Path inp = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.addInputPath(conf, inp);
FileOutputFormat.setOutputPath(conf, out);
return conf.waitForCompletion(true) ? 0: 1;
}
}
看来我的mapper类工作不正常。 我没有在所需目录中解压缩的文件。 任何帮助表示赞赏。 谢谢...
上面的代码有几个问题
我们在编写map reduce程序时需要小心,因为hadoop使用完全不同的文件系统,并且在编写代码时必须考虑这一点,并且切勿混用MR1和MR2 API。
好的,没有具体的方法可以在hadoop文件系统中解压缩文件,但是经过长时间的研究,我想出了如何直接在hadoop文件系统中解压缩文件的条件。前提是您需要将zip文件复制到特定位置然后运行mapreduce工作。 显而易见,hadoop无法理解zipfile输入格式,因此我们需要自定义Mapper和reducer,以便我们可以控制mapper发出和reducer消耗的内容。 请注意,此Mapreduce将在单个Mapper上运行,因为自定义hadoop提供的Record Reader类时,我们将禁用split方法,即使其变为false。 因此,Mapreduce将把文件名作为键 ,将未压缩文件的内容作为值。 当reducer消耗掉时,我将输出outputkey设置为null,因此只有未压缩的内容保留在reducer中,并且reducer的数量设置为1,因此所有转储都在一个零件文件中。
我们都知道hadoop无法独自处理zip文件,但是java可以借助其自己的ZipFile类进行处理,该类可以通过zipinputstrem读取zip文件内容, 并通过zipentry读取zip条目,因此我们编写了一个自定义的ZipInputFormat类,该类扩展了FileInputFormat。
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class ZipFileInputFormat extends FileInputFormat<Text, BytesWritable> {
/** See the comments on the setLenient() method */
private static boolean isLenient = false;
/**
* ZIP files are not splitable so they cannot be overrided so function
* return false
*/
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
/**
* Create the ZipFileRecordReader to parse the file
*/
@Override
public RecordReader<Text, BytesWritable> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException,
InterruptedException {
return new ZipFileRecordReader();
}
/**
*
* @param lenient
*/
public static void setLenient(boolean lenient) {
isLenient = lenient;
}
public static boolean getLenient() {
return isLenient;
}
}
请注意,RecordReader类返回ZipFileRecordReadeader,这是我们正在讨论的Hadoop RecordReader类的自定义版本。现在让我们稍微简化一下RecordReader类
import java.io.IOException;
import java.io.ByteArrayOutputStream;
import java.io.EOFException;
import java.io.IOException;
import java.util.zip.ZipEntry;
import java.util.zip.ZipException;
import java.util.zip.ZipInputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class ZipFileRecordReader extends RecordReader<Text, BytesWritable> {
/** InputStream used to read the ZIP file from the FileSystem */
private FSDataInputStream fsin;
/** ZIP file parser/decompresser */
private ZipInputStream zip;
/** Uncompressed file name */
private Text currentKey;
/** Uncompressed file contents */
private BytesWritable currentValue;
/** Used to indicate progress */
private boolean isFinished = false;
/**
* Initialise and open the ZIP file from the FileSystem
*/
@Override
public void initialize(InputSplit inputSplit,
TaskAttemptContext taskAttemptContext) throws IOException,
InterruptedException {
FileSplit split = (FileSplit) inputSplit;
Configuration conf = taskAttemptContext.getConfiguration();
Path path = split.getPath();
FileSystem fs = path.getFileSystem(conf);
// Open the stream
fsin = fs.open(path);
zip = new ZipInputStream(fsin);
}
/**
* Each ZipEntry is decompressed and readied for the Mapper. The contents of
* each file is held *in memory* in a BytesWritable object.
*
* If the ZipFileInputFormat has been set to Lenient (not the default),
* certain exceptions will be gracefully ignored to prevent a larger job
* from failing.
*/
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
ZipEntry entry = null;
try {
entry = zip.getNextEntry();
} catch (ZipException e) {
if (ZipFileInputFormat.getLenient() == false)
throw e;
}
// Sanity check
if (entry == null) {
isFinished = true;
return false;
}
// Filename
currentKey = new Text(entry.getName());
if (currentKey.toString().endsWith(".zip")) {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
byte[] temp1 = new byte[8192];
while (true) {
int bytesread1 = 0;
try {
bytesread1 = zip.read(temp1, 0, 8192);
} catch (EOFException e) {
if (ZipFileInputFormat.getLenient() == false)
throw e;
return false;
}
if (bytesread1 > 0)
bos.write(temp1, 0, bytesread1);
else
break;
}
zip.closeEntry();
currentValue = new BytesWritable(bos.toByteArray());
return true;
}
// Read the file contents
ByteArrayOutputStream bos = new ByteArrayOutputStream();
byte[] temp = new byte[8192];
while (true) {
int bytesRead = 0;
try {
bytesRead = zip.read(temp, 0, 8192);
} catch (EOFException e) {
if (ZipFileInputFormat.getLenient() == false)
throw e;
return false;
}
if (bytesRead > 0)
bos.write(temp, 0, bytesRead);
else
break;
}
zip.closeEntry();
// Uncompressed contents
currentValue = new BytesWritable(bos.toByteArray());
return true;
}
/**
* Rather than calculating progress, we just keep it simple
*/
@Override
public float getProgress() throws IOException, InterruptedException {
return isFinished ? 1 : 0;
}
/**
* Returns the current key (name of the zipped file)
*/
@Override
public Text getCurrentKey() throws IOException, InterruptedException {
return currentKey;
}
/**
* Returns the current value (contents of the zipped file)
*/
@Override
public BytesWritable getCurrentValue() throws IOException,
InterruptedException {
return currentValue;
}
/**
* Close quietly, ignoring any exceptions
*/
@Override
public void close() throws IOException {
try {
zip.close();
} catch (Exception ignore) {
}
try {
fsin.close();
} catch (Exception ignore) {
}
}
}
为了方便起见,我在源代码中给出了一些注释,以便您可以轻松了解如何使用缓冲存储器读取和写入文件。现在让我们将上述的Mapper类写入类
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMapper extends Mapper<Text, BytesWritable, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Text key, BytesWritable value, Context context)
throws IOException, InterruptedException {
String filename = key.toString();
// We only want to process .txt files
if (filename.endsWith(".txt") == false)
return;
// Prepare the content
String content = new String(value.getBytes(), "UTF-8");
context.write(new Text(content), one);
}
}
让我们快速编写相同的Reducer
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
// context.write(key, new IntWritable(sum));
context.write(new Text(key), null);
}
}
让我们快速配置Mapper和Reducer的Job
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.saama.CustomisedMapperReducer.MyMapper;
import com.saama.CustomisedMapperReducer.MyReducer;
import com.saama.CustomisedMapperReducer.ZipFileInputFormat;
import com.saama.CustomisedMapperReducer.ZipFileRecordReader;
public class MyJob {
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(MyJob.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(ZipFileInputFormat.class);
job.setOutputKeyClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
ZipFileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setNumReduceTasks(1);
job.waitForCompletion(true);
}
}
请注意,在作业类中,我们已将InputFormatClass配置为ZipFileInputFormat类,而OutputFormatClass是TextOutPutFormat类。
Mavenize Project并让依赖项保持原样运行代码,导出Jar文件并将其部署在hadoop集群上。 在CDH5.5 YARN上测试和部署。 POM文件的内容如下
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.mithun</groupId>
<artifactId>CustomisedMapperReducer</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>CustomisedMapperReducer</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.