简体繁体 English

Java＆Heritrix 3.1.x：Web内容解析？

[英]Java & Heritrix 3.1.x: Web Content parsing?

原文 2013-07-19 15:54:48 0 1 java/ web-crawler/ webpage/ document-classification/ heritrix

Since the developer documentation for Heritrix 3.x is largely out of date (most of it pertains to Heritrix 1.x, as most of the classes have been changed or code has been significantly rewritten/refactored), could anyone point me to the relevant class (or classes) of the system that deal with the actual web page content extraction? 由于Heritrix 3.x的开发人员文档已经过时（大多数与Heritrix 1.x有关，因为大多数类已更改或代码已被大量重写/重构），所以有人可以向我指出相关内容吗？处理实际网页内容提取的系统的一个或多个类？

What I want to do is obtain the content of a web page Heritrix is about to crawl and then apply a classifier to the web page's content? 我要做的是获取Heritrix将要爬网的网页的内容，然后将分类器应用于网页的内容？ (analyze structural features, etc.) I think this functionality may be distributed among the ContentExtractor class and its many subclasses, but what I'm trying to do is locate the point where I have either the web page content in its entirety or in a readable/parse-able stream. （分析结构特征等），我认为此功能可能分布在ContentExtractor类及其许多子类之间，但是我想做的是确定我完全拥有或完全拥有网页内容的位置可读/可解析的流。 Where is the content (the html) that Heritrix applies regular expressions to (in order to find links, certain file types, etc.)? Heritrix将正则表达式应用于的内容（html）在哪里（以查找链接，某些文件类型等）？

1 个解决方案

I suggest looking into a custom WriterProcessor I wrote a custom MirrorWriter that looks at the incoming data, and writes files to different locations as they come it for later post-processing. 我建议查看一个自定义的WriterProcessor，我编写了一个自定义的MirrorWriter，该MirrorWriter会查看传入的数据，并在文件到达时将文件写入不同的位置，以供以后进行后期处理。 The code for the MirrorWriter class is rather straight forward and well commented. MirrorWriter类的代码相当简单，而且注释也不错。 The documentation is here: http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/writer/MirrorWriterProcessor.html 该文档位于： http : //builds.archive.org : 8080/javadoc/heritrix-3.1.0/org/archive/modules/writer/MirrorWriterProcessor.html

If you are dead set on pre-processing, you can work with extending the org.archive.modules.extractor.ExtractorHTML and do a on-the-fly version. 如果您对预处理一无所知，则可以扩展org.archive.modules.extractor.ExtractorHTML并进行即时版本处理。 http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/extractor/ExtractorHTML.html http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/extractor/ExtractorHTML.html