簡體 English 中英

如何在Nut 2.1中抓取頁面但不獲取視頻/圖像內容？

[英]How can i crawl page but without fetching video/image content in nutch 2.1?

原文 2013-01-10 16:29:40 0 1 solr/ lucene/ web-crawler/ nutch

我想抓取頁面，我只需要采用HTML本身，避免使用所有圖像/視頻等...可以這樣做嗎？ 提前致謝。

1 個解決方案

檢查regex-urlfilter.txt文件。

您可以包括不想索引的文件擴展名的擴展名。 例如

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

我如何告訴Nutch抓取*通過*網址而不存儲它？

[英]How do I tell Nutch to crawl *through* a url without storing it?

如何在Nutch中抓取圖像？

[英]How to crawl images in Nutch?

使用Apache Nutch抓取圖像

[英]Crawl Image using Apache Nutch

我可以使用Nutch爬網，存儲在Cassandra中，使用Solr進行索引嗎？

[英]Can I crawl with Nutch, store in Cassandra, index using Solr?

Nutch-在網頁上抓取鏈接，但不編制索引

[英]Nutch - Crawl a page for links, but don't index

Nutch不抓取頁面內容

[英]Nutch not crawling page content

如何通過堅果爬網腳本設置topN

[英]How to Set topN via nutch crawl SCRIPT

如何配置Nutch只抓取seeklist中的URL？（無需向后爬行）

[英]How to config Nutch to crawl only the URLs in seeklist? (no crawl back need)

使用nutch抓取圖像及其元數據並將其索引到solr中

[英]Crawl image and their metadata using nutch and index them into solr

獲取格式化后的頁面內容

[英]get the content of the page formated as it is in nutch

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 我如何告訴Nutch抓取*通過*網址而不存儲它？如何在Nutch中抓取圖像？使用Apache Nutch抓取圖像我可以使用Nutch爬網，存儲在Cassandra中，使用Solr進行索引嗎？ Nutch-在網頁上抓取鏈接，但不編制索引 Nutch不抓取頁面內容如何通過堅果爬網腳本設置topN 如何配置Nutch只抓取seeklist中的URL？（無需向后爬行）使用nutch抓取圖像及其元數據並將其索引到solr中獲取格式化后的頁面內容

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM