[英]Extracting headers from WARC.gz file
I have been searching through the site a lot, but could not really find what I need. 我一直在搜索网站,但无法真正找到我需要的东西。 I have web.warc.gz file with data in it and I need to extract WARC headers. 我有web.warc.gz文件,其中包含数据,我需要提取WARC标头。 I have installed Tomcat and Wayback (1.6) trying to derive that with ./warc-header script, which is provided by Wayback, but I keep getting an error message for the format I am using: 我已经安装了Tomcat和Wayback(1.6)尝试使用Wayback提供的./warc-header脚本来推导它,但我不断收到我正在使用的格式的错误消息:
Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz \r\n\
~/Desktop/output.csv type \r\n
USAGE: tgtWarc fieldsSrc id
tgtWarc is the path to the target WARC.gz
fieldsSrc is the path to the text of the record
make sure each line is terminated by \r\n
and that the file ends with a blank, \r\n terminiated line
id is the XXX in:
Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
of the header record... header...
Or another type of error: 或者其他类型的错误:
Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz
~/Desktop/output.csv Content-Type
java.io.IOException: End-Of-Stream before \r\n\r\n End-Of-ANVLRecord:
at org.archive.util.anvl.ANVLRecord.load(ANVLRecord.java:163)
at org.archive.wayback.util.WARCHeader.writeHeaderRecord(WARCHeader.java:43)
at org.archive.wayback.util.WARCHeader.main(WARCHeader.java:75)
I am quite sure it is a format I am writing in a command line, but I still can't get it right. 我很确定它是我在命令行中编写的格式,但我仍然无法正确使用它。 Please help? 请帮忙?
You can get it using the below github project code: 您可以使用以下github项目代码获取它:
https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.