简体   繁体   English

根据URL从WARC文件中检索记录

[英]Retrieving records from WARC file based on url

I have to retrieve records from a *.warc.gz file based on Target-URI. 我必须从基于Target-URI的* .warc.gz文件中检索记录。 The documentation says that this requires external CDXJ index files to be created. 该文档说,这需要创建外部CDXJ索引文件。

I've tried opening the file as gzip.open() and do a seek(offset) , but the seek operation is taking quite some time(seconds). 我试过以gzip.open()打开文件并执行一次seek(offset) ,但是seek操作要花费相当多的时间(秒)。

Is there any other correct way to retrieve the records. 还有其他正确的方法来检索记录。

Edit:I'm using warc python library and they don't seem to provide a direct f.seek() on the warc file. 编辑:我正在使用warc python库,他们似乎没有在warc文件上提供直接的f.seek()。

You should do the seek on the file before decompressing. 解压缩之前,应先对文件进行查找。 Usually, WARC files are compressed record by record and the offset and length in the CDXJ allow to clip out a single WARC record, then do a gzip.open() then on the single record. 通常,WARC文件是按记录压缩的记录,而CDXJ中的偏移量和长度允许剪切单个WARC记录,然后执行gzip.open()然后对单个记录进行压缩。 In doubt, better use a library. 毫无疑问,最好使用一个库。 Warcio even provides a command-line tool to extract a single record by offset: warcio extract xyz.warc.gz offset . Warcio甚至提供了一个命令行工具来按offset提取单个记录: warcio extract xyz.warc.gz offset

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM