ElasticSearch 5.0.0攝取附件插件問題將PDF編入索引

Question

我的環境：

{   "name" : "node-0",
    "cluster_name" : "ES500-JBD-0",  
    "cluster_uuid" : "q_akJRkrSI-glTwT5vfH4A",  
  "version" : {
    "number" : "5.0.0",
    "build_hash" : "253032b",
    "build_date" : "2016-10-26T04:37:51.531Z",
    "build_snapshot" : false,
    "lucene_version" : "6.2.0"   },
  "tagline" : "You Know, for Search"
}

索引和管道創建（ 編輯3 ）：

curl -XPUT 'vm01.jbdata.fr:9200/_ingest/pipeline/attachment' -d '{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}'

用法語映射創建（ 編輯4 ）：

curl -XPUT 'vm01.jbdata.fr:9200/ged-idx-00' -d '{
  "mappings" : {
    "ged_type_0" : {
      "properties" : {
         "attachment.data" : {
            "type": "text",
            "analyzer" : "french"
            }
         }
      }
   }
}'

ES特定的配置（ 編輯1和編輯2 ）：

$ bin/elasticsearch-plugin list
ingest-attachment

來自config / elasticsearch.yml

plugin.mandatory: ingest-attachment

命令S為PDF編制索引：

1 /一個“原始” PDF。

curl -H 'Content-Type: application/pdf' -XPUT vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment -d @/tmp/zookeeperAdmin.pdf

{“ error ”：{“ root_cause”：[{“ type”：“ settings_exception”，“ reason”：“無法從[％PDF-1.4％.... ... 0D33957F>] >> startxref76764 %%加載設置EOF；行：1，列：2]“}}，” 狀態 “： 500 }

2 /一個“ B64ed” PDF。

aPath='/tmp/zookeeperAdmin.pdf'
aB64content=$(base64 $aPath | perl -pe 's/\n/\\n/g')
echo $aB64content > /tmp/zookeeperAdmin.pdf.b64
curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "file" : "content" : "'$aB64content'"
}'

{“ 錯誤 ”：{“ root_cause”：...“原因”：“無法解析創建索引的源”，“ caused_by”：{“ type”：“ json_parse_exception”，“原因”：“意外字符（'： '（代碼58））：希望逗號分隔[\\ n在[源：org.elasticsearch.transport.netty4.ByteBufStreamInput@65a254b6;行：2，列：25]“}}，” 狀態 “： 400 }

如何正確使用攝取附件插件噸索引PDF？

Answer 1

根據我的經驗，該文件需要在Base64中進行編碼，因此您的選項2應該是不錯的選擇。

關於您的最后嘗試：

curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "file" : "content" : "'$aB64content'"
}'

提供的JSON格式不正確（“ a”：“ b”：“ c”），因此出現錯誤。

正如在管道創建中所指定的那樣，您只需要一個數據字段，因此以下操作可以解決問題：

curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "data" : "'$aB64content'"
}'

Answer 2

實際上，從PDF中正確提取文本非常困難，通常您必須提取嵌入式圖像或渲染整個頁面，然后根據從頁面提取的文本及其內容對其進行OCR（例如，您必須分析編碼是否正確）或不）。 您根本無法調整Tika在解析過程中使用任何自定義邏輯，也無法通過Ingest Attachment進行調整。 如果您希望獲得高質量的PDF解析-並非您所需要的是Ingest Attachment，那么您必須自己做。

在此處閱讀全文： https : //blog.ambar.cloud/ingest-attachment-plugin-for-elasticsearch-should-you-use-it/

ElasticSearch 5.0.0攝取附件插件問題將PDF編入索引

問題描述

2 個解決方案

解決方案1
1 2016-12-12 02:55:43

解決方案2
0 2017-04-04 13:52:42

ElasticSearch 5.0.0攝取附件插件問題將PDF編入索引

問題描述

2 個解決方案

解決方案1 1 2016-12-12 02:55:43

解決方案2 0 2017-04-04 13:52:42

解決方案1
1 2016-12-12 02:55:43

解決方案2
0 2017-04-04 13:52:42