简体   繁体   English

从WARC.gz文件中提取标头

[英]Extracting headers from WARC.gz file

I have been searching through the site a lot, but could not really find what I need. 我一直在搜索网站,但无法真正找到我需要的东西。 I have web.warc.gz file with data in it and I need to extract WARC headers. 我有web.warc.gz文件,其中包含数据,我需要提取WARC标头。 I have installed Tomcat and Wayback (1.6) trying to derive that with ./warc-header script, which is provided by Wayback, but I keep getting an error message for the format I am using: 我已经安装了Tomcat和Wayback(1.6)尝试使用Wayback提供的./warc-header脚本来推导它,但我不断收到我正在使用的格式的错误消息:

Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz \r\n\ 
~/Desktop/output.csv type \r\n
      USAGE: tgtWarc fieldsSrc id
        tgtWarc is the path to the target WARC.gz
          fieldsSrc is the path to the text of the record
    make sure each line is terminated by \r\n
    and that the file ends with a blank, \r\n terminiated line
id is the XXX in:
    Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
    of the header record... header... 

Or another type of error: 或者其他类型的错误:

   Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz 
    ~/Desktop/output.csv Content-Type
    java.io.IOException: End-Of-Stream before \r\n\r\n End-Of-ANVLRecord:

at org.archive.util.anvl.ANVLRecord.load(ANVLRecord.java:163)
at org.archive.wayback.util.WARCHeader.writeHeaderRecord(WARCHeader.java:43)
at org.archive.wayback.util.WARCHeader.main(WARCHeader.java:75)

I am quite sure it is a format I am writing in a command line, but I still can't get it right. 我很确定它是我在命令行中编写的格式,但我仍然无法正确使用它。 Please help? 请帮忙?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM