简体繁体中英

Apache Nutch index only article pages to Solr

原文 2020-08-25 02:25:38 1 1 solr/ web-crawler/ nutch/ web-mining/ nutch2

I have setup Nutch 1.17 for crawling few website. As usual, there can be two type of web pages at high level. First those that are category pages or home pages that does not contain the details of any specific story but provide links and short text of multiple pages. Second, there are pages that contains information of complete story in detail ie, articles.

Now my issue is how can I identify that this is actual article page and this page is a category page. Further, I am also interested to index only story pages ?

I think there isn't any thing in Nutch default. How could I achieve this behavior ?

1 answers

At the core, question boils down to how identify article/story pages vs a homepage or category page. This is usually very domain specific and potentially depends on a lot of factors (rewrite rules on the server side or CMS used, etc).

If you are fairly familiar with the domains that you're crawling perhaps you can use a regex to differentiate between the different types of pages. Assuming that you can use a regex (or another field present in the NutchDocument ) to differentiate the pages you should be able to use the index-jexl-filter plugin to selectively index only those article pages.

I would say that normally you wouldn't want to skip entirely the category pages (or homepage) because these type of pages are usually a good source of new links for your crawl.

index apache nutch result in solr

Using Apache Solr to index Nutch data

apache nutch to index to solr via REST

Apache Nutch - indexing only the modified files in Solr

Configure Nutch to only index specific filetypes in Solr

Apache Nutch to index only part of page content

Apache Nutch and Solr integration

update solr index by nutch

How to index crawled "html" from Apache Nutch to Solr?

How to index apache nutch fetched content without parsing into solr

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question index apache nutch result in solr Using Apache Solr to index Nutch data apache nutch to index to solr via REST Apache Nutch - indexing only the modified files in Solr Configure Nutch to only index specific filetypes in Solr Apache Nutch to index only part of page content Apache Nutch and Solr integration update solr index by nutch How to index crawled "html" from Apache Nutch to Solr? How to index apache nutch fetched content without parsing into solr

Related Tags

Apache Nutch index only article pages to Solr

Question

1 answers

solution1 0 2020-08-25 18:10:43

solution1
0 2020-08-25 18:10:43