简体   繁体   中英

Java & Heritrix 3.1.x: Web Content parsing?

Since the developer documentation for Heritrix 3.x is largely out of date (most of it pertains to Heritrix 1.x, as most of the classes have been changed or code has been significantly rewritten/refactored), could anyone point me to the relevant class (or classes) of the system that deal with the actual web page content extraction?

What I want to do is obtain the content of a web page Heritrix is about to crawl and then apply a classifier to the web page's content? (analyze structural features, etc.) I think this functionality may be distributed among the ContentExtractor class and its many subclasses, but what I'm trying to do is locate the point where I have either the web page content in its entirety or in a readable/parse-able stream. Where is the content (the html) that Heritrix applies regular expressions to (in order to find links, certain file types, etc.)?

I suggest looking into a custom WriterProcessor I wrote a custom MirrorWriter that looks at the incoming data, and writes files to different locations as they come it for later post-processing. The code for the MirrorWriter class is rather straight forward and well commented. The documentation is here: http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/writer/MirrorWriterProcessor.html

If you are dead set on pre-processing, you can work with extending the org.archive.modules.extractor.ExtractorHTML and do a on-the-fly version. http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/extractor/ExtractorHTML.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM