简体   繁体   中英

Reading WARC Files Efficiently

I am reading a WARC file with python's 'warc' library. Current file that I am using, is around 4.50 GB. The thing is ;

file = warc.open("random.warc")
html_lists = [line for line in file]

Executing these 2 lines takes up to 40 seconds. Since there will be 64000 more files like this one, it is not acceptable that it takes 40 seconds per file. Do you guys have any tips to improve performance or any different approaches?

Edit : I found out that Beautifulsoup operations take some time. So I removed it and wrote the necessary stuff myself. It is 100x faster now. It takes +- 60 seconds to read and process 4.50 GB data. With this line of code I remove the scripts from data;

clean = re.sub(r"<script.*?</script>", "", string=text)

And with this one I split the text and remove the stamp which I don't need

warc_stamp = str(soup).split(r"\r\n\r\n")

As I said it is faster but 60 seconds are not that good in this case. Any suggestions ?

but 60 seconds are not that good in this case

Of course, it would mean that processing all 64,000 WARC files takes 45 days if not done in parallel. But as a comparison: the Hadoop jobs to crawl the content of the WARC files and also those to transform WARCs into WAT and WET files need around 600 CPU days each.

WARC files are gzip-compressed because disk space and download bandwidth are usually the limiting factors. Decompression defines the baseline for any optimization. Eg, decompressing a 946 MB WARC file takes 21 seconds:

% time zcat CC-MAIN-20170629154125-20170629174125-00719.warc.gz >/dev/null 
real    0m21.546s
user    0m21.304s
sys     0m0.240s

Iterating over the WARC records needs only little extra time:

% cat benchmark_warc.py
import gzip
import sys
import warc

n_records = 0

for record in warc.WARCFile(fileobj=(gzip.open(sys.argv[1]))):
    if record['Content-Type'] == 'application/http; msgtype=response':
        n_records += 1

print("{} records".format(n_records))

% time python benchmark_warc.py CC-MAIN-20170629154125-20170629174125-00719.warc.gz
43799 records

real    0m23.048s
user    0m22.169s
sys     0m0.878s

If processing the payload only doubles or triples the time needed anyway for decompression (I cannot imagine that you can outperform the GNU gzip implementation significantly), you're close to the optimum. If 45 days is too long, the development time is better invested in parallelization of the processing. There are already plenty of examples available how to achieve this for Common Crawl data, eg cc-mrjob or cc-pyspark .

Get the source code of that module, and check for optimization potential.

Use a profiler to identify performance bottlenecks, then focus on these for optimization.

It can make a huge difference to rewrite Python code in Cython and compile it into native code. So that is likely worth a try.

But by any means, rather than speculating on an internet forum on how to accelerate a two line script, you really need to work with the actual code underneath!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM