简体   繁体   English

有效地读取WARC文件

[英]Reading WARC Files Efficiently

I am reading a WARC file with python's 'warc' library. 我正在使用python的'warc'库读取WARC文件。 Current file that I am using, is around 4.50 GB. 我正在使用的当前文件约为4.50 GB。 The thing is ; 事情是 ;

file = warc.open("random.warc")
html_lists = [line for line in file]

Executing these 2 lines takes up to 40 seconds. 执行这两行最多需要40秒。 Since there will be 64000 more files like this one, it is not acceptable that it takes 40 seconds per file. 由于将有64000个类似这样的文件,因此每个文件花费40秒是不可接受的。 Do you guys have any tips to improve performance or any different approaches? 你们有提高性能的方法或其他方法吗?

Edit : I found out that Beautifulsoup operations take some time. 编辑:我发现Beautifulsoup操作需要一些时间。 So I removed it and wrote the necessary stuff myself. 所以我删除了它,然后自己写了必要的东西。 It is 100x faster now. 现在快100倍。 It takes +- 60 seconds to read and process 4.50 GB data. 读取和处理4.50 GB数据需要+-60秒。 With this line of code I remove the scripts from data; 通过这一行代码,我从数据中删除了脚本;

clean = re.sub(r"<script.*?</script>", "", string=text)

And with this one I split the text and remove the stamp which I don't need 然后用这个我拆分文本并删除不需要的图章

warc_stamp = str(soup).split(r"\r\n\r\n")

As I said it is faster but 60 seconds are not that good in this case. 就像我说的那样,速度更快,但是60秒在这种情况下并不是很好。 Any suggestions ? 有什么建议么 ?

but 60 seconds are not that good in this case 但是60秒在这种情况下不是很好

Of course, it would mean that processing all 64,000 WARC files takes 45 days if not done in parallel. 当然,这意味着如果不并行处理所有64,000个WARC文件,则需要45天。 But as a comparison: the Hadoop jobs to crawl the content of the WARC files and also those to transform WARCs into WAT and WET files need around 600 CPU days each. 但是作为比较:Hadoop作业要抓取WARC文件的内容以及将WARC转换为WAT和WET文件的工作每个大约需要600 CPU天。

WARC files are gzip-compressed because disk space and download bandwidth are usually the limiting factors. WARC文件采用gzip压缩,因为磁盘空间和下载带宽通常是限制因素。 Decompression defines the baseline for any optimization. 减压定义任何优化的基准。 Eg, decompressing a 946 MB WARC file takes 21 seconds: 例如,解压缩946 MB WARC文件需要21秒:

% time zcat CC-MAIN-20170629154125-20170629174125-00719.warc.gz >/dev/null 
real    0m21.546s
user    0m21.304s
sys     0m0.240s

Iterating over the WARC records needs only little extra time: 遍历WARC记录仅需要很少的额外时间:

% cat benchmark_warc.py
import gzip
import sys
import warc

n_records = 0

for record in warc.WARCFile(fileobj=(gzip.open(sys.argv[1]))):
    if record['Content-Type'] == 'application/http; msgtype=response':
        n_records += 1

print("{} records".format(n_records))

% time python benchmark_warc.py CC-MAIN-20170629154125-20170629174125-00719.warc.gz
43799 records

real    0m23.048s
user    0m22.169s
sys     0m0.878s

If processing the payload only doubles or triples the time needed anyway for decompression (I cannot imagine that you can outperform the GNU gzip implementation significantly), you're close to the optimum. 如果处理有效载荷仅是解压缩所需时间的两倍或三倍(我无法想象您可以大大胜过GNU gzip实现),则您接近最佳状态。 If 45 days is too long, the development time is better invested in parallelization of the processing. 如果45天太长,则开发时间最好用于并行处理。 There are already plenty of examples available how to achieve this for Common Crawl data, eg cc-mrjob or cc-pyspark . 已经有很多可用的示例来实现Common Crawl数据的实现,例如cc-mrjobcc-pyspark

Get the source code of that module, and check for optimization potential. 获取该模块的源代码,然后检查优化潜力。

Use a profiler to identify performance bottlenecks, then focus on these for optimization. 使用探查器确定性能瓶颈,然后集中精力进行优化。

It can make a huge difference to rewrite Python code in Cython and compile it into native code. 在Cython中重写Python代码并将其编译为本地代码可能会产生巨大的变化。 So that is likely worth a try. 因此,可能值得尝试。

But by any means, rather than speculating on an internet forum on how to accelerate a two line script, you really need to work with the actual code underneath! 但是,无论如何,您确实需要在下面的实际代码中工作,而不是在互联网论坛上猜测如何加速两行脚本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM