Solr Html 的數據導入處理程序

Question

TLDR

如何配置 solr 數據導入處理程序，以便它將導入 html，類似於 solr 的“post”實用程序？

語境

我們正在做一個小項目，其中代碼會將一組頁面從 wiki/confluence 導出到“直接 html”（用於 DR 數據中心的可用性——直接 html 頁面將不依賴於數據庫等）

我們要索引 solr 中的 html 頁面。

我們使用 solr 提供的“post 實用程序”“讓它工作”

post -c OPERATIONS -recursive -0 -host solr $(find . -name '*.html')

這很好.....但是，我們想利用數據導入處理程序（DIH），即用對 DIH 端點（'/dataimport'）的單個 http 調用替換 shell 命令

問題

如何配置 tika“數據配置 xml”文件以獲得與 solr“發布命令”類似的功能？

當我使用 data-config.xml 進行配置時，solr 文檔僅以“id”和“version”字段結尾（即其中 id 是未標記的文件名）

更正：我最初寫了'“id”和“title”字段......”'

        "id":"database_operations_2019.html",
        "_version_":1650836000296927232},

但是，當我使用“bin/post”時，文檔具有這些字段，即包括標記化的標題：

"id":"/usr/local/html/OPERATIONS_2019_1119_1500/./database_operations_2019.html",
        "stream_size":[54115],
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.html.HtmlParser"],
        "stream_content_type":["text/html"],
        "dc_title":["Database Operations 2019 Guidebook"],
        "content_encoding":["UTF-8"],
        "content_type_hint":["text/html; charset=UTF-8"],
        "resourcename":["/usr/local/html/OPERATIONS_2019_1119_1500/./database_operations_2019.html"],
        "title":["Database Operations 2019 Guidebook"],
        "content_type":["text/html; charset=UTF-8"],
        "_version_":1650834641083432960},

幾點

我嘗試過 RTM，但不遵循“字段”如何映射到“html 正文”
解析一個充滿 HTML 的目錄是一個大約 1999 年的問題，所以我不希望有很多人
我看過 SimplePostTool.java（bin/post 的實現）......沒有真正的答案。

數據配置 Xml 文件

<dataConfig>
  <dataSource type="BinFileDataSource"/>
  <document>
    <entity name="file" processor="FileListEntityProcessor"
        dataSource="null"
        htmlMapper="true"
        format="html"
            baseDir="/usr/local/var/www/confluence/OPERATIONS"
        fileName=".*html"
            rootEntity="false">

      <field column="file" name="id"/>

      <entity name="html" processor="TikaEntityProcessor"
              url="${file.fileAbsolutePath}" format="text">

        <field column="title" name="title" meta="true"/>
        <field column="dc:format" name="format" meta="true"/>

        <field column="text" name="text"/>

      </entity>

    </entity>
  </document>
</dataConfig>

Answer 1

我最終編寫了幾行代碼來解析 html 文件 (jsoup) 並放棄了 solr 數據導入處理程序 (DIH)。

使用 Spring 和 solr 和 jsoup html 解析器非常簡單。

One caveat: my java "bean" object to store the solr fields needed a " text " field for the out-of-the-box default-search-field to work (ie with the solr docker instance)

Solr Html 的數據導入處理程序

問題描述

1 個解決方案

解決方案1
0 2019-12-05 21:00:33

Solr Html 的數據導入處理程序

問題描述

1 個解決方案

解決方案1 0 2019-12-05 21:00:33

解決方案1
0 2019-12-05 21:00:33