简体繁体中英

Hadoop process WARC files

原文 2016-10-30 05:22:52 5 1 java/ hadoop/ mapreduce/ elastic-map-reduce/ common-crawl

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed.

Using the current InputFileFormat, the file is sent to one mapper and is not split. I understand this is the correct behavior for an encrypted file. Would there be a performance benefit to decrypting the file as an intermediate step before running the job to allow the job to be split and thus use more mappers? Would that be possible? Does having more mappers create more overhead in latency or is it better to have one mapper? Thanks for your help.

1 answers

Although WARC files are gzipped they are splittable (cf. Best splittable compression for Hadoop input = bz2? ) because every record has its own deflate block. But the record offsets must be known in advance.

But is this really necessary? The Common Crawl WARC files are all about 1 GB in size, it should be processed normally within max. 15 minutes. Given the overhead to launch a map task that's a reasonable time for a mapper to run. Ev., a mapper could also process a few WARC files, but it's important that you have enough splits of the input WARC file list so that all nodes are running tasks. To process a single WARC file on Hadoop would mean a lot of unnecessary overhead.

Getting a topology on StormCrawler to properly write warc files

importing data from WARC files (WebArchive)

process HTML files using hadoop map reduce

How to process two files in Hadoop Mapreduce?

How to loop through WARC files using HeaderedArchiveRecord with Heritrix 3.1

How to Process multiple folders in HADOOP

hadoop suggestions on how to process logs

Hadoop conf files missing

Hadoop:Grouping files for mapping

Sequence Files in Hadoop

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Getting a topology on StormCrawler to properly write warc files importing data from WARC files (WebArchive) process HTML files using hadoop map reduce How to process two files in Hadoop Mapreduce? How to loop through WARC files using HeaderedArchiveRecord with Heritrix 3.1 How to Process multiple folders in HADOOP hadoop suggestions on how to process logs Hadoop conf files missing Hadoop:Grouping files for mapping Sequence Files in Hadoop

Related Tags

Hadoop process WARC files

Question

1 answers

solution1 0 2016-10-30 20:39:58

solution1
0 2016-10-30 20:39:58