简体   繁体   中英

How to loop through WARC files using HeaderedArchiveRecord with Heritrix 3.1

I'm using the Heritrix 3.1 Java library. Just to be clear, I'm not interested in crawling but only in processing data from compressed WARC (*.warc.gz) files generated by another team. For each WWW document stored in the WARC file, I need some information from the record header, some from the HTTP headers, and the full content of the HTTP payload/body, so I think I need to use the HeaderedArchiveRecord class.

WARCReader warcReader = WARCReaderFactory.get(warcFile);
int inputSequence = -1;

ArchiveRecord record = warcReader.get();
while(record != null){
  inputSequence++;

  // Skip the 0th record, which is just the archive guff.
  if (inputSequence == 0) {
    // print some info but do not process this record
  }
  else if (! record.hasContentHeaders()) {
    // print some info but do not process this record
  }
  else  {
    HeaderedArchiveRecord hRecord = new HeaderedArchiveRecord(record);
    ArchiveRecordHeader archiveHeader = hRecord.getHeader();
    gate.Document document = makeDocumentHeritrix(archiveHeader,
       inputSequence,  hRecord);
    //...
  }
  record.close();
  record = warcReader.get();  // line 754
}

warcReader.close();

When I run this, I get an exception with this cause

Caused by: java.io.IOException: Failed to read WARC_MAGIC
    at org.archive.io.warc.WARCRecord.parseHeaders(WARCRecord.java:116)
    at org.archive.io.warc.WARCRecord.<init>(WARCRecord.java:90)
    at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:94)
    at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:44)
    at org.archive.io.ArchiveReader.get(ArchiveReader.java:159)
    at
gate.arcomem.batch.Enrichment.makeCorpusWithHeritrix(Enrichment.java:754)

where my line 754 is as marked above. The code in my makeDocumentHeritrix(...) method used to throw a similar exception but with Failed to find WARC_MAGIC until I moved the line hrecord.skipHttpHeader(); to before Header[] httpHeader = record.getContentHeaders(); inside it.

I have tried to search the web for examples of code to loop through records in WARC files, but haven't found any, and I recall that when I used heritrix 1.14 several years ago to do something similar, I had to do some weird things to manipulate the offsets in the files, but the related methods in WARCReader are now all private or protected, so I would not expect to have to do that with the newer library.

I had success with the following code:

Iterator<ArchiveRecord> archIt = WARCReaderFactory.get(new File(args[0])).iterator();
while (archIt.hasNext()) {
     handleRecord(archIt.next());
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM