简体   繁体   English

如何使用带有Heritrix 3.1的HeaderedArchiveRecord遍历WARC文件

[英]How to loop through WARC files using HeaderedArchiveRecord with Heritrix 3.1

I'm using the Heritrix 3.1 Java library. 我正在使用Heritrix 3.1 Java库。 Just to be clear, I'm not interested in crawling but only in processing data from compressed WARC (*.warc.gz) files generated by another team. 需要明确的是,我对爬网不感兴趣,而仅对处理另一个团队生成的压缩WARC(* .warc.gz)文件中的数据感兴趣。 For each WWW document stored in the WARC file, I need some information from the record header, some from the HTTP headers, and the full content of the HTTP payload/body, so I think I need to use the HeaderedArchiveRecord class. 对于WARC文件中存储的每个WWW文档,我都需要记录头中的一些信息,HTTP标头中的一些信息以及HTTP有效负载/主体的全部内容,因此我认为我需要使用HeaderedArchiveRecord类。

WARCReader warcReader = WARCReaderFactory.get(warcFile);
int inputSequence = -1;

ArchiveRecord record = warcReader.get();
while(record != null){
  inputSequence++;

  // Skip the 0th record, which is just the archive guff.
  if (inputSequence == 0) {
    // print some info but do not process this record
  }
  else if (! record.hasContentHeaders()) {
    // print some info but do not process this record
  }
  else  {
    HeaderedArchiveRecord hRecord = new HeaderedArchiveRecord(record);
    ArchiveRecordHeader archiveHeader = hRecord.getHeader();
    gate.Document document = makeDocumentHeritrix(archiveHeader,
       inputSequence,  hRecord);
    //...
  }
  record.close();
  record = warcReader.get();  // line 754
}

warcReader.close();

When I run this, I get an exception with this cause 当我运行它时,由于这个原因我得到了一个例外

Caused by: java.io.IOException: Failed to read WARC_MAGIC
    at org.archive.io.warc.WARCRecord.parseHeaders(WARCRecord.java:116)
    at org.archive.io.warc.WARCRecord.<init>(WARCRecord.java:90)
    at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:94)
    at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:44)
    at org.archive.io.ArchiveReader.get(ArchiveReader.java:159)
    at
gate.arcomem.batch.Enrichment.makeCorpusWithHeritrix(Enrichment.java:754)

where my line 754 is as marked above. 我的第754行如上所述。 The code in my makeDocumentHeritrix(...) method used to throw a similar exception but with Failed to find WARC_MAGIC until I moved the line hrecord.skipHttpHeader(); 我的makeDocumentHeritrix(...)方法中的代码曾经引发类似的异常,但是在我移动hrecord.skipHttpHeader();之前Failed to find WARC_MAGIC hrecord.skipHttpHeader(); to before Header[] httpHeader = record.getContentHeaders(); Header[] httpHeader = record.getContentHeaders(); inside it. 在里面。

I have tried to search the web for examples of code to loop through records in WARC files, but haven't found any, and I recall that when I used heritrix 1.14 several years ago to do something similar, I had to do some weird things to manipulate the offsets in the files, but the related methods in WARCReader are now all private or protected, so I would not expect to have to do that with the newer library. 我曾尝试在网上搜索代码示例以遍历WARC文件中的记录,但没有找到任何示例,并且我还记得几年前我使用heritrix 1.14做类似的事情时,我不得不做一些奇怪的事情来处理文件中的偏移量,但是WARCReader中的相关方法现在都是私有的或受保护的,因此我不希望必须使用较新的库来执行此操作。

I had success with the following code: 我成功完成了以下代码:

Iterator<ArchiveRecord> archIt = WARCReaderFactory.get(new File(args[0])).iterator();
while (archIt.hasNext()) {
     handleRecord(archIt.next());
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM