简体繁体 English

根据URL从WARC文件中检索记录

[英]Retrieving records from WARC file based on url

原文 2018-03-20 06:46:37 0 1 python/ python-3.x/ warc

I have to retrieve records from a *.warc.gz file based on Target-URI. 我必须从基于Target-URI的* .warc.gz文件中检索记录。 The documentation says that this requires external CDXJ index files to be created. 该文档说，这需要创建外部CDXJ索引文件。

I've tried opening the file as gzip.open() and do a seek(offset) , but the seek operation is taking quite some time(seconds). 我试过以gzip.open()打开文件并执行一次seek(offset) ，但是seek操作要花费相当多的时间（秒）。

Is there any other correct way to retrieve the records. 还有其他正确的方法来检索记录。

Edit:I'm using warc python library and they don't seem to provide a direct f.seek() on the warc file. 编辑：我正在使用warc python库，他们似乎没有在warc文件上提供直接的f.seek（）。

1 个解决方案

You should do the seek on the file before decompressing. 解压缩之前，应先对文件进行查找。 Usually, WARC files are compressed record by record and the offset and length in the CDXJ allow to clip out a single WARC record, then do a gzip.open() then on the single record. 通常，WARC文件是按记录压缩的记录，而CDXJ中的偏移量和长度允许剪切单个WARC记录，然后执行gzip.open（）然后对单个记录进行压缩。 In doubt, better use a library. 毫无疑问，最好使用一个库。 Warcio even provides a command-line tool to extract a single record by offset: warcio extract xyz.warc.gz offset . Warcio甚至提供了一个命令行工具来按offset提取单个记录： warcio extract xyz.warc.gz offset 。