简体   繁体   中英

Apache Nutch index only article pages to Solr

I have setup Nutch 1.17 for crawling few website. As usual, there can be two type of web pages at high level. First those that are category pages or home pages that does not contain the details of any specific story but provide links and short text of multiple pages. Second, there are pages that contains information of complete story in detail ie, articles.

Now my issue is how can I identify that this is actual article page and this page is a category page. Further, I am also interested to index only story pages ?

I think there isn't any thing in Nutch default. How could I achieve this behavior ?

At the core, question boils down to how identify article/story pages vs a homepage or category page. This is usually very domain specific and potentially depends on a lot of factors (rewrite rules on the server side or CMS used, etc).

If you are fairly familiar with the domains that you're crawling perhaps you can use a regex to differentiate between the different types of pages. Assuming that you can use a regex (or another field present in the NutchDocument ) to differentiate the pages you should be able to use the index-jexl-filter plugin to selectively index only those article pages.

I would say that normally you wouldn't want to skip entirely the category pages (or homepage) because these type of pages are usually a good source of new links for your crawl.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM