简体   繁体   中英

Hadoop process WARC files

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed.

Using the current InputFileFormat, the file is sent to one mapper and is not split. I understand this is the correct behavior for an encrypted file. Would there be a performance benefit to decrypting the file as an intermediate step before running the job to allow the job to be split and thus use more mappers? Would that be possible? Does having more mappers create more overhead in latency or is it better to have one mapper? Thanks for your help.

Although WARC files are gzipped they are splittable (cf. Best splittable compression for Hadoop input = bz2? ) because every record has its own deflate block. But the record offsets must be known in advance.

But is this really necessary? The Common Crawl WARC files are all about 1 GB in size, it should be processed normally within max. 15 minutes. Given the overhead to launch a map task that's a reasonable time for a mapper to run. Ev., a mapper could also process a few WARC files, but it's important that you have enough splits of the input WARC file list so that all nodes are running tasks. To process a single WARC file on Hadoop would mean a lot of unnecessary overhead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM