简体繁体 English

使用Apache Nutch抓取图像

[英]Crawl Image using Apache Nutch

原文 2017-12-03 11:02:26 7 2 mongodb/ apache/ solr/ web-crawler/ nutch

I installed Apache Nutch 2.3.1 and Solr 6.5.1 and MongoDB 3.4.7. 我安装了Apache Nutch 2.3.1，Solr 6.5.1和MongoDB 3.4.7。 After I crawl urls that contain many images, in Solr and mongoDB isn't any image and video. 在搜寻包含许多图像的URL之后，在Solr和mongoDB中不再包含任何图像和视频。 I also changed regex-urlfilter.txt file in apache nutch and delete postfix that were related to image(.png,.jpeg,.gift,...). 我还更改了Apache坚果中的regex-urlfilter.txt文件，并删除了与image（.png，.jpeg，.gift，...）相关的后缀。 After that I changed suffix-urlfilter.txt file and comment jpeg,gif,png too. 之后，我更改了suffix-urlfilter.txt文件，并注释了jpeg，gif，png。
After do that works the Apache Nutch doesn't crawl image. 完成后，Apache Nutch不会抓取图像。 Now I want to know how I can crawl image and see that in Solr? 现在我想知道如何爬行图像并在Solr中看到它？ As I read about it, I understand that I should create plug-ins.Is my impression correct? 当我读到它时，我知道我应该创建插件。我的印象正确吗？

2 个解决方案

Nutch supports several formats : Plain Text, HTML/XHTML+XML, XML, MS Office files, Adobe PDF, RSS, RTF, MP3. Nutch 支持多种格式：纯文本，HTML / XHTML + XML，XML，MS Office文件，Adobe PDF，RSS，RTF，MP3。 Unfortunately, there is not support for any sort of image files. 不幸的是，不支持任何类型的图像文件。 Apart from this, I'm curious, what do you want to index in image file? 除此之外，我很好奇，您要在图像文件中建立索引吗？

If I understand your question what you want to accomplish is extracting all the metadata from the images and indexing only this in Solr, right? 如果我理解您的问题，您要完成的工作是从图像中提取所有元数据，然后在Solr中仅对此进行索引，对吗？

If Nutch is not even fetching your images then is more likely that some of the URL filters is excluding the URL from being fetched (check the logs). 如果Nutch甚至没有获取您的图像，则某些URL过滤器很可能会将URL从获取中排除（检查日志）。 You need to describe your changes to the different files otherwise it will be impossible to help you. 您需要描述对不同文件的更改，否则将无济于事。

Now, back to the original question, if you want to only index image URLs (along with the metadata) then you need to filter what you index into Solr. 现在，回到原始问题，如果您只想索引图像URL（以及元数据），则需要过滤索引到Solr中的内容。 Unfortunately Nutch 2.3 doesn't offer (out of the box) this feature. 不幸的是，Nutch 2.3没有（开箱即用）提供此功能。 In Nutch 1.x you could use mimetype-filter which allows you to specify what you want to index into Solr/ES depending on the mime type of the URL. 在Nutch 1.x中，您可以使用mimetype-filter ，它允许您根据URL的mime类型指定要索引到Solr / ES中的内容。 My suggestion is to use Nutch 1.x unless you have a very good reason to use Nutch 2.x. 我的建议是使用Nutch 1.x，除非您有充分的理由使用Nutch2.x。 Otherwise you could port the mimetype-filter plugin to 2.x or write your own IndexingFiler that supports your own logic. 否则，您可以将mimetype-filter插件移植到2.x或编写支持自己逻辑的自己的IndexingFiler 。

Keep in mind that the information that you'll get in Solr is only limited to what tika can extract from the image file (metadata) which is usually not very well curated. 请记住，您在Solr中获得的信息仅限于tika可以从图像文件（元数据）中提取的内容，而通常这些文件通常不是很理想。