簡體 English 中英

Apache Nutch 僅將文章頁面索引到 Solr

[英]Apache Nutch index only article pages to Solr

原文 2020-08-25 02:25:38 9 1 solr/ web-crawler/ nutch/ web-mining/ nutch2

我已經設置了 Nutch 1.17 來抓取幾個網站。 像往常一樣，在高層可以有兩種類型的網頁。 首先是類別頁面或主頁，它們不包含任何特定故事的詳細信息，但提供多個頁面的鏈接和短文本。 其次，有些頁面包含完整故事的詳細信息，即文章。

現在我的問題是如何確定這是實際的文章頁面，而該頁面是類別頁面。 此外，我也有興趣只索引故事頁面？

我認為 Nutch 默認沒有任何東西。 我怎么能實現這種行為？

問題的核心歸結為如何識別文章/故事頁面與主頁或類別頁面。 這通常是非常特定於域的，並且可能取決於很多因素（服務器端的重寫規則或使用的 CMS 等）。

如果您對要抓取的域非常熟悉，也許您可以使用正則表達式來區分不同類型的頁面。 假設您可以使用正則表達式（或NutchDocument存在的另一個字段）來區分頁面，您應該能夠使用index-jexl-filter插件來選擇性地僅索引那些文章頁面。

我想說的是，通常您不想完全跳過類別頁面（或主頁），因為這些類型的頁面通常是您抓取新鏈接的良好來源。

[英]index apache nutch result in solr

[英]Using Apache Solr to index Nutch data

[英]apache nutch to index to solr via REST

[英]Apache Nutch - indexing only the modified files in Solr

[英]Configure Nutch to only index specific filetypes in Solr

[英]Apache Nutch to index only part of page content

[英]Apache Nutch and Solr integration

[英]update solr index by nutch

[英]How to index crawled "html" from Apache Nutch to Solr?

[英]How to index apache nutch fetched content without parsing into solr

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 索引apache堅果結果solr 使用 Apache Solr 索引 Nutch 數據 Apache nutch以通過REST索引到solr Apache Nutch-僅在Solr中索引已修改的文件將Nutch配置為僅索引Solr中的特定文件類型 Apache Nutch 僅索引部分頁面內容 Apache Nutch和Solr集成通過螺母更新Solr索引如何將抓取的“html”從 Apache Nutch 索引到 Solr？如何在不解析為Solr的情況下索引Apache Nuch獲取的內容

相關標簽