简体繁体 English

Apache Nutch 仅将文章页面索引到 Solr

[英]Apache Nutch index only article pages to Solr

原文 2020-08-25 02:25:38 0 1 solr/ web-crawler/ nutch/ web-mining/ nutch2

I have setup Nutch 1.17 for crawling few website.我已经设置了 Nutch 1.17 来抓取几个网站。 As usual, there can be two type of web pages at high level.像往常一样，在高层可以有两种类型的网页。 First those that are category pages or home pages that does not contain the details of any specific story but provide links and short text of multiple pages.首先是类别页面或主页，它们不包含任何特定故事的详细信息，但提供多个页面的链接和短文本。 Second, there are pages that contains information of complete story in detail ie, articles.其次，有些页面包含完整故事的详细信息，即文章。

Now my issue is how can I identify that this is actual article page and this page is a category page.现在我的问题是如何确定这是实际的文章页面，而该页面是类别页面。 Further, I am also interested to index only story pages ?此外，我也有兴趣只索引故事页面？

I think there isn't any thing in Nutch default.我认为 Nutch 默认没有任何东西。 How could I achieve this behavior ?我怎么能实现这种行为？

1 个解决方案

At the core, question boils down to how identify article/story pages vs a homepage or category page.问题的核心归结为如何识别文章/故事页面与主页或类别页面。 This is usually very domain specific and potentially depends on a lot of factors (rewrite rules on the server side or CMS used, etc).这通常是非常特定于域的，并且可能取决于很多因素（服务器端的重写规则或使用的 CMS 等）。

If you are fairly familiar with the domains that you're crawling perhaps you can use a regex to differentiate between the different types of pages.如果您对要抓取的域非常熟悉，也许您可以使用正则表达式来区分不同类型的页面。 Assuming that you can use a regex (or another field present in the NutchDocument ) to differentiate the pages you should be able to use the index-jexl-filter plugin to selectively index only those article pages.假设您可以使用正则表达式（或NutchDocument存在的另一个字段）来区分页面，您应该能够使用index-jexl-filter插件来选择性地仅索引那些文章页面。

I would say that normally you wouldn't want to skip entirely the category pages (or homepage) because these type of pages are usually a good source of new links for your crawl.我想说的是，通常您不想完全跳过类别页面（或主页），因为这些类型的页面通常是您抓取新链接的良好来源。