从WARC.gz文件中提取标头

Question

I have been searching through the site a lot, but could not really find what I need. 我一直在搜索网站，但无法真正找到我需要的东西。 I have web.warc.gz file with data in it and I need to extract WARC headers. 我有web.warc.gz文件，其中包含数据，我需要提取WARC标头。 I have installed Tomcat and Wayback (1.6) trying to derive that with ./warc-header script, which is provided by Wayback, but I keep getting an error message for the format I am using: 我已经安装了Tomcat和Wayback（1.6）尝试使用Wayback提供的./warc-header脚本来推导它，但我不断收到我正在使用的格式的错误消息：

Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz \r\n\ 
~/Desktop/output.csv type \r\n
      USAGE: tgtWarc fieldsSrc id
        tgtWarc is the path to the target WARC.gz
          fieldsSrc is the path to the text of the record
    make sure each line is terminated by \r\n
    and that the file ends with a blank, \r\n terminiated line
id is the XXX in:
    Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
    of the header record... header...

Or another type of error: 或者其他类型的错误：

   Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz 
    ~/Desktop/output.csv Content-Type
    java.io.IOException: End-Of-Stream before \r\n\r\n End-Of-ANVLRecord:

at org.archive.util.anvl.ANVLRecord.load(ANVLRecord.java:163)
at org.archive.wayback.util.WARCHeader.writeHeaderRecord(WARCHeader.java:43)
at org.archive.wayback.util.WARCHeader.main(WARCHeader.java:75)

I am quite sure it is a format I am writing in a command line, but I still can't get it right. 我很确定它是我在命令行中编写的格式，但我仍然无法正确使用它。 Please help? 请帮忙？

Answer 1

You can get it using the below github project code: 您可以使用以下github项目代码获取它：

https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java

从WARC.gz文件中提取标头

问题描述

1 个解决方案

解决方案1
1 2015-04-02 11:23:31

从WARC.gz文件中提取标头

问题描述

1 个解决方案

解决方案1 1 2015-04-02 11:23:31

解决方案1
1 2015-04-02 11:23:31