简体   繁体   English

如何在Nutch 2.3中抓取图像作为HBase作为后端?

[英]How to crawl images in Nutch 2.3 as HBase as backend?

I want to crawl images from certain sites. 我想从某些站点抓取图像。 So far I tried modifiying regex-urlfilter.txt. 到目前为止,我尝试修改regex-urlfilter.txt。

I changed: 我变了:

 -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PP
 T|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

To: 至:

-\.(css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|t
gz|TGZ|mov|MOV|exe|EXE|js|JS)$

But it didn't work. 但这没有用。 I am surprised that I didn't find any documentation regarding crawling images using Nutch 2.3 . 令我感到惊讶的是,我没有找到有关使用Nutch 2.3抓取图像的任何文档。 Referal to any existing documentation would really be a great help. 引用任何现有文档确实是一个很大的帮助。

In order to fetch and store images using Nutch you have to follow these steps: 为了使用Nutch提取和存储图像,您必须遵循以下步骤:

1- Adding regular expression to not filter image formats, such as jpg, jpeg, tif, gif, png and etc... (which you already did) 1-添加正则表达式以不过滤图像格式,例如jpg,jpeg,tif,gif,png等(您已经做过)

2- Implementing a parse plugin for parsing images. 2-实现用于解析图像的解析插件。 For more information about Nutch extension points and writing required plugin follow these links: 有关Nutch扩展点和编写所需插件的更多信息,请遵循以下链接:

http://wiki.apache.org/nutch/AboutPlugins http://wiki.apache.org/nutch/AboutPlugins

http://wiki.apache.org/nutch/WritingPluginExample http://wiki.apache.org/nutch/WritingPluginExample

3- Tell Nutch about the implemented plugin and using that for image file formats: 3-告诉Nutch有关已实现的插件并将其用于图像文件格式的信息:

For this purpose you have to follow two different steps, first, modify conf/parse-plugins.xml and map your implemented plugin to image file formats: 为此,您必须遵循两个不同的步骤,首先,修改conf / parse-plugins.xml并将实现的插件映射为图像文件格式:

<mimeType name="image/jpeg">
        <plugin id="parse-image" />
</mimeType>
<mimeType name="image/gif">
        <plugin id="parse-image" />
</mimeType>
<mimeType name="image/png">
        <plugin id="parse-image" />
</mimeType>

second, add the implemented plugin to nutch-site.xml to be run at Nutch runtime. 其次,将已实现的插件添加到nutch-site.xml ,以在Nutch运行时中运行。 You have to add the implemented plugin to <plugin.includes> property. 您必须将实现的插件添加到<plugin.includes>属性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM