简体   繁体   English

将Nutch配置为仅索引Solr中的特定文件类型

[英]Configure Nutch to only index specific filetypes in Solr

I am looking for a way to configure Nutch to crawl the web, but only index certain types of files (XML to be specific) into Solr. 我正在寻找一种配置Nutch来爬网的方法,但是只将某些类型的文件(特定于XML)编入Solr。 I'm pretty sure a custom plugin would do the job, probably based on the index-more code, but I'd rather not do that unless I have to. 我很确定一个自定义插件可以完成这项工作,可能基于更多索引的代码,但是除非有必要,否则我不愿意这样做。 I'm also sure I could suck everything into Solr then delete unwanted content with Solr's API, but this is a bit hacky. 我还确定我可以将所有内容都吸收到Solr中,然后使用Solr的API删除不需要的内容,但这有点hacky。 Is there a way to configure Nutch to only index certain filetypes in Solr? 有没有一种方法可以将Nutch配置为仅在Solr中索引某些文件类型?

In nutch you can define filters for urls. 在小节中,您可以为网址定义过滤器。 What about filtering by the name of the fileextension? 用文件扩展名过滤怎么办?

You can filter the file type according to the extension. 您可以根据扩展名过滤文件类型。
You can specify the extensions you want to include or exclude in regex-urlfilter.txt 您可以在regex-urlfilter.txt中指定要包括或排除的扩展名

eg for exclusion (-) :- 例如排除(-):-

#skip image and other suffixes we can't yet parse 29 # for a more extensive coverage use the urlfilter-suffix plugin -.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ #跳过图像和我们无法解析的其他后缀#使用更广泛的覆盖范围,请使用urlfilter-suffix插件-。(gif(GIF | jpg | JPG | png | PNG | ico | ICO | css | CSS | sit | SIT | eps | EPS | wmf | WMF | zip | ZIP | ppt | PPT | mpg | MPG | xls | XLS | gz | GZ | rpm | RPM | tgz | TGZ | mov | MOV | exe | EXE | jpeg | JPEG | bmp | BMP | js | JS)$

with + you can just specify the inclusion list. 用+可以只指定包含列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM