简体繁体 English

在hadoop中添加对Zip文件的支持

[英]Adding support for Zip files in hadoop

原文 2015-03-23 13:53:52 3 1 hadoop/ zip/ hadoop-streaming/ hadoop2

Hadoop by default have a support for reading .gz compressed files, I want to have similar support for .zip files. Hadoop默认情况下支持读取.gz压缩文件，我希望对.zip文件也具有类似的支持。 I should be able to read content of zip files by using hadoop -text command. 我应该能够使用hadoop -text命令读取zip文件的内容。

I am looking for an approach where I dont have to implement inputformat and recordreader for zip files. 我正在寻找一种无需为zip文件实现inputformat和recordreader的方法。 I want my jobs to be completely agnostic of the format of the input files, it should work irrespective of whether the data is zipped or unzipped. 我希望我的工作完全与输入文件的格式无关，无论数据是压缩还是未压缩，它都可以正常工作。 Similar to how it is for.gz files. 与for.gz文件类似。

1 个解决方案

I'm sorry to say that I only see two ways to do this from "within" hadoop, either using a custom inputformat and recordreader based on ZipInputStream (which you clearly specified you were not interested in) or by detecting .zip input files and unzipping them before launching the job. 我很遗憾地说，我只看到两种方法可以从“内部” hadoop中执行此操作，或者使用基于ZipInputStream的自定义输入格式和recordreader（您明确指定自己不感兴趣），或者通过检测.zip输入文件和在启动作业之前将其解压缩。

I would personally do this from outside hadoop, converting to gzip (or LZO indexed if I needed splittable files) via a script before running the job, but you most certainly already thought about that... 我个人会从hadoop外部执行此操作，然后在运行作业之前通过脚本转换为gzip（如果需要可拆分文件，则为LZO索引），但是您肯定已经想到了...

I'm also interested to see if someone can come up with an unexpected answer. 我也很想看看是否有人可以提出一个意外的答案。