简体   繁体   中英

Adding support for Zip files in hadoop

Hadoop by default have a support for reading .gz compressed files, I want to have similar support for .zip files. I should be able to read content of zip files by using hadoop -text command.

I am looking for an approach where I dont have to implement inputformat and recordreader for zip files. I want my jobs to be completely agnostic of the format of the input files, it should work irrespective of whether the data is zipped or unzipped. Similar to how it is for.gz files.

I'm sorry to say that I only see two ways to do this from "within" hadoop, either using a custom inputformat and recordreader based on ZipInputStream (which you clearly specified you were not interested in) or by detecting .zip input files and unzipping them before launching the job.

I would personally do this from outside hadoop, converting to gzip (or LZO indexed if I needed splittable files) via a script before running the job, but you most certainly already thought about that...

I'm also interested to see if someone can come up with an unexpected answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM