简体繁体中英

How can i crawl page but without fetching video/image content in nutch 2.1?

原文 2013-01-10 16:29:40 0 1 solr/ lucene/ web-crawler/ nutch

I want to crawl a page and I need to take only the HTML itself, avoiding all images/videos etc... Is it possible to do this? Thanks in advance.

1 answers

Check regex-urlfilter.txt file.

You can include the extensions of the file extensions which you dont want to index. eg

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

How do I tell Nutch to crawl *through* a url without storing it?

How to crawl images in Nutch?

Crawl Image using Apache Nutch

Can I crawl with Nutch, store in Cassandra, index using Solr?

Nutch - Crawl a page for links, but don't index

Nutch not crawling page content

How to Set topN via nutch crawl SCRIPT

How to config Nutch to crawl only the URLs in seeklist? (no crawl back need)

Crawl image and their metadata using nutch and index them into solr

get the content of the page formated as it is in nutch

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How do I tell Nutch to crawl *through* a url without storing it? How to crawl images in Nutch? Crawl Image using Apache Nutch Can I crawl with Nutch, store in Cassandra, index using Solr? Nutch - Crawl a page for links, but don't index Nutch not crawling page content How to Set topN via nutch crawl SCRIPT How to config Nutch to crawl only the URLs in seeklist? (no crawl back need) Crawl image and their metadata using nutch and index them into solr get the content of the page formated as it is in nutch

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM