如何使用攝取附件插件和 JavaScript 客戶端在 Elasticsearch 6.1 中索引 PDF？

Question

我嘗試按照以下問題的答案中的說明進行操作：

如何使用攝取附件插件在 Elasticsearch 5.0.0 中索引 pdf 文件？

我找不到很多 ElasticSearch 的 JavaScript 客戶端示例，所以這里是我所擁有的：

創建索引

// elasticsearch Client
var elasticsearch = require('elasticsearch');
var client = new elasticsearch.Client({hosts: [ 'http://localhost:9200/']});

// Create index
client.create({index: 'pdfs', type: 'pdf', id: 'my-index-id', 
    body: {description: 'Test pdf indexing'}
})
.then(function () {console.log("Index created");})
.catch(function (error) {console.log(error);});

在 Node 中定義索引映射：

var body = {
    pdf:{
        properties:{
            title : {"type" : "keyword", "index" : "false"},
            type  : {"type" : "keyword", "index" : "false"},
            "attachment.pdf" : {"type" : "keyword"}
        }
    }
}

client.indices.putMapping({index:"pdfs", type:"pdf", body:body})
.then((response) => {addPipeline()})
.catch((error) => {console.log("putMapping error: " + error)})

使用 PUT API 在節點集群中定義攝取管道

function addPipeline(){
  client.ingest.putPipeline({
    id: 'my-pipeline-id',
    body: {
      "description" : "parse pdfs and index into ES",
      "processors" : [
        { "attachment" : { "field" : "pdf", "indexed_chars" : -1 } },
        { "remove" : { "field" : "pdf" } }
      ]
    }
  })
  .then(function () {
     console.log("putPipeline Resolved");
   })
  .catch(function (error) {
     console.log("putPipeline error: " + error);
   });
};

在嘗試上傳 PDF 之前，我檢查了索引是否已創建：

curl -XGET 'localhost:9200/_cat/indices?v&pretty'

結果：

health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   .kibana EaUbEQCETVKQbYThrhPGaA   1   1          1            0      3.6kb          3.6kb
yellow open   pdfs    Z2SR-ApFR9SYsvY08tgSZw   5   1          1            0      4.6kb          4.6kb

當我嘗試使用以下命令索引 PDF時，出現錯誤。

curl -H 'Content-Type: application/pdf' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d'
{
    "pdf": @/Users/user/path/to/pdf/file.pdf
}'

錯誤：

{"error":"Content-Type header [application/pdf] is not supported","status":406}

這是因為我的 PDF 不是 Base64 編碼的還是我做錯了什么？ 我正在嘗試創建一個數字圖書館來搜索 PDF。

更新：

我用以下代碼編碼了我的pdf：

openssl base64 -in /Users/user/path/to/pdf/file.pdf -out base64_encoded_file

重新創建我的索引並在 base64_encoded_file 上運行以下命令：

curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d @/base64_encoded_file

我收到以下錯誤：

Warning: Couldn't read data from file "/base64_encoded_file", this makes an empty POST.
{"error":{"root_cause":[{"type":"parse_exception","reason":"request body is required"}],"type":"parse_exception","reason":"request body is required"},"status":400}

我嘗試將文件添加為正文：

curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d '
        {
          "pdf" : @/base64_encoded_file
        }'

錯誤：

{"error":{"root_cause":[{"type":"parse_exception","reason":"Failed to parse content to map"}],"type":"parse_exception","reason":"Failed to parse content to map","caused_by":{"type":"json_parse_exception","reason":"Unexpected character ('@' (code 64)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@6db5a3dc; line: 3, column: 16]"}},"status":400}

哈普

Answer 1

我找到了我的問題的答案：

Elasticsearch 不會從源中獲取數據，因此，

curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d '
        {
          "pdf" : @/base64_encoded_file
        }'

不會工作。 附件選項中的“字段”（在我的示例中為“pdf”）必須是數據，而不是文件路徑。 該線程解釋了將 [pdf] 內容發送到 elasticsearch 的三個選項：

您可以提取內容 [from the pdf] 並將您想要索引的內容發送到 elasticsearch。
您可以將二進制 BASE64 發送到 elasticsearch 攝取，這將進行提取
您可以將二進制文件發送到 FSCrawler，它會在發送到 elasticsearch 之前進行提取。

總之，傳遞給elasticsearch的數據必須是文檔中定義的。

curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d '
    {
        "pdf" : "base64_encoded_data"
    }'

如何使用攝取附件插件和 JavaScript 客戶端在 Elasticsearch 6.1 中索引 PDF？

問題描述

更新：

1 個解決方案

解決方案1
5 已采納 2018-03-08 08:15:03

如何使用攝取附件插件和 JavaScript 客戶端在 Elasticsearch 6.1 中索引 PDF？

問題描述

更新：

1 個解決方案

解決方案1 5 已采納 2018-03-08 08:15:03

解決方案1
5 已采納 2018-03-08 08:15:03