使用nutch抓取图像及其元数据并将其索引到solr中

Question

我想建立一个基于迷你图像的搜索引擎，我可以提供图像文件，它将在solr中搜索类似的图像。 我正在使用nutch作为爬行部分并将数据索引到solr中。 我已经对nutch conf文件进行了更改，例如 -

将image/*添加到mimetype-filter.txt
从suffix-urlfilter.txt删除了图片扩展名 - 不要跳过它们

我还在solr schema.xml中添加了字段 -

<field name="name" type="string" indexed="true" stored="true" />
<field name="iso" type="string" indexed="true" stored="true" multiValued="true" />
<field name="iso_string" type="string" indexed="true" stored="true" multiValued="true" />
<field name="aperture" type="double" indexed="true" stored="true" />
<field name="exposure" type="string" indexed="true" stored="true" />
<field name="exposure_time" type="double" indexed="true" stored="true" />
<field name="focal" type="string" indexed="true" stored="true" />
<field name="focal_35" type="string" indexed="true" stored="true" />
<dynamicField name="ignored_*" type="string" indexed="false" stored="false" multiValued="true" />

但是当我爬行时，没有数据被索引到solr中。 我无法找到任何关于此的文档/教程。 我还浏览了一些关于stackoverflow的帖子，用于使用nutch进行图像爬行。 但我没有找到那些有用的。

有人可以指导我走向正确的方向吗？ 提前致谢。

Answer 1

这个问题没有简单/简单的答案，即使不涉及爬行部分，解析图像也是一件棘手的事情。 除了你已经完成的工作之外，首先需要启用parse-tika插件（ parse-html只处理HTML文档）。 Apache Tika能够提取有关图像的一些元数据。

您还需要启用mimetype-filter插件（这不仅可以编辑配置文件，还可以在nutch-site.xml文件中启用）。 完成这些配置后，您应该尝试使用bin/nutch parsechecker <URL>工具来测试包含某些图像的URL，并查看是否可以在Outlinks部分找到图像的URL。 另外，检查对映像URL运行parsechecker以查看parsechecker正在提取的元数据。 在此之后，对两个URL运行bin/nutch indexchecker工具，并检查它将索引到Solr的哪些字段，并相应地在模式中创建它们。 请记住，Tika可能会为每种格式提取不同的元数据。

使用nutch抓取图像及其元数据并将其索引到solr中

问题描述

1 个解决方案

解决方案1
0 2019-04-09 15:42:24

使用nutch抓取图像及其元数据并将其索引到solr中

问题描述

1 个解决方案

解决方案1 0 2019-04-09 15:42:24

解决方案1
0 2019-04-09 15:42:24