使用nutch抓取圖像及其元數據並將其索引到solr中

Question

我想建立一個基於迷你圖像的搜索引擎，我可以提供圖像文件，它將在solr中搜索類似的圖像。 我正在使用nutch作為爬行部分並將數據索引到solr中。 我已經對nutch conf文件進行了更改，例如 -

將image/*添加到mimetype-filter.txt
從suffix-urlfilter.txt刪除了圖片擴展名 - 不要跳過它們

我還在solr schema.xml中添加了字段 -

<field name="name" type="string" indexed="true" stored="true" />
<field name="iso" type="string" indexed="true" stored="true" multiValued="true" />
<field name="iso_string" type="string" indexed="true" stored="true" multiValued="true" />
<field name="aperture" type="double" indexed="true" stored="true" />
<field name="exposure" type="string" indexed="true" stored="true" />
<field name="exposure_time" type="double" indexed="true" stored="true" />
<field name="focal" type="string" indexed="true" stored="true" />
<field name="focal_35" type="string" indexed="true" stored="true" />
<dynamicField name="ignored_*" type="string" indexed="false" stored="false" multiValued="true" />

但是當我爬行時，沒有數據被索引到solr中。 我無法找到任何關於此的文檔/教程。 我還瀏覽了一些關於stackoverflow的帖子，用於使用nutch進行圖像爬行。 但我沒有找到那些有用的。

有人可以指導我走向正確的方向嗎？ 提前致謝。

Answer 1

這個問題沒有簡單/簡單的答案，即使不涉及爬行部分，解析圖像也是一件棘手的事情。 除了你已經完成的工作之外，首先需要啟用parse-tika插件（ parse-html只處理HTML文檔）。 Apache Tika能夠提取有關圖像的一些元數據。

您還需要啟用mimetype-filter插件（這不僅可以編輯配置文件，還可以在nutch-site.xml文件中啟用）。 完成這些配置后，您應該嘗試使用bin/nutch parsechecker <URL>工具來測試包含某些圖像的URL，並查看是否可以在Outlinks部分找到圖像的URL。 另外，檢查對映像URL運行parsechecker以查看parsechecker正在提取的元數據。 在此之后，對兩個URL運行bin/nutch indexchecker工具，並檢查它將索引到Solr的哪些字段，並相應地在模式中創建它們。 請記住，Tika可能會為每種格式提取不同的元數據。

使用nutch抓取圖像及其元數據並將其索引到solr中

問題描述

1 個解決方案

解決方案1
0 2019-04-09 15:42:24

使用nutch抓取圖像及其元數據並將其索引到solr中

問題描述

1 個解決方案

解決方案1 0 2019-04-09 15:42:24

解決方案1
0 2019-04-09 15:42:24