简体繁体中英

Java & Heritrix 3.1.x: Web Content parsing?

原文 2013-07-19 15:54:48 9 1 java/ web-crawler/ webpage/ document-classification/ heritrix

Since the developer documentation for Heritrix 3.x is largely out of date (most of it pertains to Heritrix 1.x, as most of the classes have been changed or code has been significantly rewritten/refactored), could anyone point me to the relevant class (or classes) of the system that deal with the actual web page content extraction?

What I want to do is obtain the content of a web page Heritrix is about to crawl and then apply a classifier to the web page's content? (analyze structural features, etc.) I think this functionality may be distributed among the ContentExtractor class and its many subclasses, but what I'm trying to do is locate the point where I have either the web page content in its entirety or in a readable/parse-able stream. Where is the content (the html) that Heritrix applies regular expressions to (in order to find links, certain file types, etc.)?

1 answers

I suggest looking into a custom WriterProcessor I wrote a custom MirrorWriter that looks at the incoming data, and writes files to different locations as they come it for later post-processing. The code for the MirrorWriter class is rather straight forward and well commented. The documentation is here: http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/writer/MirrorWriterProcessor.html

If you are dead set on pre-processing, you can work with extending the org.archive.modules.extractor.ExtractorHTML and do a on-the-fly version. http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/extractor/ExtractorHTML.html

How to run Grails 3.1.x Application with JAVA

Configuring JRebel 5 with Resin 3.1.x on mac

Spring 3.1.x milestone repository

No Session found for current thread (Spring 3.1.X and Hibernate 4)

Mocking/autowiring beans with Spring 3.1.x and MockMvc

BadCredentialsException when migrating from Spring 3.0.x to 3.1.x

Is it possible to define more KeyGenerator classes for cache in Spring version 3.1.x?

Spring Security 3.1.x & JSF 2.0 : “ BeanCreationException: Error creating bean with name 'org.springframework.security.filterChains' ”

How to loop through WARC files using HeaderedArchiveRecord with Heritrix 3.1

Parsing json file content in java

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to run Grails 3.1.x Application with JAVA Configuring JRebel 5 with Resin 3.1.x on mac Spring 3.1.x milestone repository No Session found for current thread (Spring 3.1.X and Hibernate 4) Mocking/autowiring beans with Spring 3.1.x and MockMvc BadCredentialsException when migrating from Spring 3.0.x to 3.1.x Is it possible to define more KeyGenerator classes for cache in Spring version 3.1.x? Spring Security 3.1.x & JSF 2.0 : “ BeanCreationException: Error creating bean with name 'org.springframework.security.filterChains' ” How to loop through WARC files using HeaderedArchiveRecord with Heritrix 3.1 Parsing json file content in java

Related Tags

Java & Heritrix 3.1.x: Web Content parsing?

Question

1 answers

solution1 1 2013-07-22 22:12:19

solution1
1 2013-07-22 22:12:19