简体   繁体   English

如何使用摄取附件插件和 JavaScript 客户端在 Elasticsearch 6.1 中索引 PDF?

[英]How to index a PDF in Elasticsearch 6.1 with ingest-attachment plugin & JavaScript Client?

I tried following the instructions in the answer given to the following question:我尝试按照以下问题的答案中的说明进行操作:

How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin? 如何使用摄取附件插件在 Elasticsearch 5.0.0 中索引 pdf 文件?

I couldn't find many examples of the JavaScript client for ElasticSearch, so here is what I have:我找不到很多 ElasticSearch 的 JavaScript 客户端示例,所以这里是我所拥有的:

Create index创建索引

// elasticsearch Client
var elasticsearch = require('elasticsearch');
var client = new elasticsearch.Client({hosts: [ 'http://localhost:9200/']});

// Create index
client.create({index: 'pdfs', type: 'pdf', id: 'my-index-id', 
    body: {description: 'Test pdf indexing'}
})
.then(function () {console.log("Index created");})
.catch(function (error) {console.log(error);});

Define Index Mapping going in Node:在 Node 中定义索引映射:

var body = {
    pdf:{
        properties:{
            title : {"type" : "keyword", "index" : "false"},
            type  : {"type" : "keyword", "index" : "false"},
            "attachment.pdf" : {"type" : "keyword"}
        }
    }
}

client.indices.putMapping({index:"pdfs", type:"pdf", body:body})
.then((response) => {addPipeline()})
.catch((error) => {console.log("putMapping error: " + error)})

Define Ingest Pipeline in Node cluster with PUT API使用 PUT API 在节点集群中定义摄取管道

function addPipeline(){
  client.ingest.putPipeline({
    id: 'my-pipeline-id',
    body: {
      "description" : "parse pdfs and index into ES",
      "processors" : [
        { "attachment" : { "field" : "pdf", "indexed_chars" : -1 } },
        { "remove" : { "field" : "pdf" } }
      ]
    }
  })
  .then(function () {
     console.log("putPipeline Resolved");
   })
  .catch(function (error) {
     console.log("putPipeline error: " + error);
   });
};

Before I try to upload a PDF, I checked that the index was created:在尝试上传 PDF 之前,我检查了索引是否已创建:

curl -XGET 'localhost:9200/_cat/indices?v&pretty'

Result:结果:

health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   .kibana EaUbEQCETVKQbYThrhPGaA   1   1          1            0      3.6kb          3.6kb
yellow open   pdfs    Z2SR-ApFR9SYsvY08tgSZw   5   1          1            0      4.6kb          4.6kb

When I try to index the PDF with the following command, I get an error.当我尝试使用以下命令索引 PDF时,出现错误。

curl -H 'Content-Type: application/pdf' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d'
{
    "pdf": @/Users/user/path/to/pdf/file.pdf
}'

Error:错误:

{"error":"Content-Type header [application/pdf] is not supported","status":406}

Is this because my PDF is not Base64 encoded or am I doing something else wrong?这是因为我的 PDF 不是 Base64 编码的还是我做错了什么? I am trying to create a digital library to search through PDFs.我正在尝试创建一个数字图书馆来搜索 PDF。

UPDATE:更新:

I encoded my pdf with:我用以下代码编码了我的pdf:

openssl base64 -in /Users/user/path/to/pdf/file.pdf -out base64_encoded_file

recreated my index and ran the following command on the base64_encoded_file:重新创建我的索引并在 base64_encoded_file 上运行以下命令:

curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d @/base64_encoded_file

And I got the following error:我收到以下错误:

Warning: Couldn't read data from file "/base64_encoded_file", this makes an empty POST.
{"error":{"root_cause":[{"type":"parse_exception","reason":"request body is required"}],"type":"parse_exception","reason":"request body is required"},"status":400}

I tried adding the file as a body:我尝试将文件添加为正文:

curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d '
        {
          "pdf" : @/base64_encoded_file
        }'

Error:错误:

{"error":{"root_cause":[{"type":"parse_exception","reason":"Failed to parse content to map"}],"type":"parse_exception","reason":"Failed to parse content to map","caused_by":{"type":"json_parse_exception","reason":"Unexpected character ('@' (code 64)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@6db5a3dc; line: 3, column: 16]"}},"status":400}

Halp哈普

I found the answer to my problem:我找到了我的问题的答案:

Elasticsearch does not fetch data from source so, Elasticsearch 不会从源中获取数据,因此,

curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d '
        {
          "pdf" : @/base64_encoded_file
        }'

won't work.不会工作。 The "field" from attachment options (in my example, "pdf") must be data, not a filepath. 附件选项中的“字段”(在我的示例中为“pdf”)必须是数据,而不是文件路径。 This thread explains three options for sending [pdf] content to elasticsearch: 线程解释了将 [pdf] 内容发送到 elasticsearch 的三个选项:

  1. You can extract the content [from the pdf] and just send what you want to index to elasticsearch.您可以提取内容 [from the pdf] 并将您想要索引的内容发送到 elasticsearch。
  2. You can send the binary BASE64 to elasticsearch ingest which will do the extraction您可以将二进制 BASE64 发送到 elasticsearch 摄取,这将进行提取
  3. You can send the binary to FSCrawler which will do the extraction before sending to elasticsearch.您可以将二进制文件发送到 FSCrawler,它会在发送到 elasticsearch 之前进行提取。

In short, the data passed to elasticsearch must be as defined in the documentation .总之,传递给elasticsearch的数据必须是文档中定义的。

curl -H 'Content-Type: application/json' -XPUT 'localhost:9200/my_index/my_type/id?pipeline=my-pipeline-id' -d '
    {
        "pdf" : "base64_encoded_data"
    }'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM