简体   繁体   English

解压缩Hadoop hdfs目录中的所有Gzip文件

[英]Decompress all Gzip files in a Hadoop hdfs directory

On my HDFS, I have a bunch of gzip files that I want to decompress to a normal format. 在我的HDFS上,我有一堆gzip文件,我想要解压缩到正常格式。 Is there an API for doing this? 有没有这样做的API? Or how could I write a function to do this? 或者我怎么能写一个函数来做到这一点?

I don't want to use any command-line tools; 我不想使用任何命令行工具; instead, I want to accomplish this task by writing Java code. 相反,我想通过编写Java代码来完成这项任务。

You need a CompressionCodec to decompress the file. 您需要CompressionCodec来解压缩文件。 The implementation for gzip is GzipCodec . gzip的实现是GzipCodec You get a CompressedInputStream via the codec and out the result with simple IO. 您可以通过编解码器获得CompressedInputStream ,并使用简单的IO输出结果。 Something like this: say you have a file file.gz 这样的事情:说你有一个文件file.gz

//path of file
String uri = "/uri/to/file.gz";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);

CompressionCodecFactory factory = new CompressionCodecFactory(conf);
// the correct codec will be discovered by the extension of the file
CompressionCodec codec = factory.getCodec(inputPath);

if (codec == null) {
    System.err.println("No codec found for " + uri);
    System.exit(1);
}

// remove the .gz extension
String outputUri =
    CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());

InputStream is = codec.createInputStream(fs.open(inputPath));
OutputStream out = fs.create(new Path(outputUri));
IOUtils.copyBytes(is, out, conf);

// close streams

UPDATE UPDATE

If you need to get all the files in a directory, the you should get the FileStatus es like 如果你需要获取目录中的所有文件,你应该得到FileStatus es

FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] statuses = fs.listStatus(new Path("hdfs/path/to/dir"));

Then just loop 然后循环

for (FileStatus status: statuses) {
    CompressionCodec codec = factory.getCodec(status.getPath());
    ...
    InputStream is = codec.createInputStream(fs.open(status.getPath());
    ...
}

I use an identity map Hadoop job I wrote in Scalding to change compression / change split size / etc. 我使用我在Scalding中编写的身份映射Hadoop作业来更改压缩/更改分割大小等。

class IdentityMap(args: Args) extends ConfiguredJob(args) {
  CombineFileMultipleTextLine(args.list("in"): _*).read.mapTo[String, String]('line -> 'line)(identity)
  .write(if (args.boolean("compress")) TsvCompressed(args("out")) else TextLine(args("out")))
}

General configuration abstract class: 一般配置抽象类:

abstract class ConfiguredJob(args: Args) extends Job(args) {
  override def config(implicit mode: Mode): Map[AnyRef, AnyRef] = {
    val Megabyte = 1024 * 1024
    val conf = super.config(mode)
    val splitSizeMax = args.getOrElse("splitSizeMax", "1024").toInt * Megabyte
    val splitSizeMin = args.getOrElse("splitSizeMin", "512").toInt * Megabyte
    val jobPriority = args.getOrElse("jobPriority","NORMAL")
    val maxHeap = args.getOrElse("maxHeap","512m")
    conf ++ Map("mapred.child.java.opts" -> ("-Xmx" + maxHeap),
      "mapred.output.compress" -> (if (args.boolean("compress")) "true" else "false"),
      "mapred.min.split.size" -> splitSizeMin.toString,
      "mapred.max.split.size" -> splitSizeMax.toString,
//      "mapred.output.compression.codec" -> args.getOrElse("codec", "org.apache.hadoop.io.compress.BZip2Codec"), //Does not work, has to be -D flag
      "mapred.job.priority" -> jobPriority)
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM