简体   繁体   English

如何使用ingest-attachment插件索引Elasticsearch 5.0.0中的pdf文件?

[英]How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

I'm new to Elasticsearch and I read here https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html that the mapper-attachments plugin is deprecated in elasticsearch 5.0.0. 我是Elasticsearch的新手,我在这里阅读https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html ,在elasticsearch 5.0.0中不推荐使用mapper-attachments插件。

I now try to index a pdf file with the new ingest-attachment plugin and upload the attachment. 我现在尝试使用新的ingest-attachment插件索引pdf文件并上传附件。

What I've tried so far is 到目前为止我尝试过的是

curl -H 'Content-Type: application/pdf' -XPOST localhost:9200/test/1 -d @/cygdrive/c/test/test.pdf

but I get the following error: 但是我收到以下错误:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

I would expect that the pdf file will be indexed and uploaded. 我希望pdf文件将被编入索引并上传。 What am I doing wrong? 我究竟做错了什么?

I also tested Elasticsearch 2.3.3 but the mapper-attachments plugin is not valid for this version and I don't want to use any older version of Elasticsearch. 我还测试了Elasticsearch 2.3.3,但mapper-attachments插件对此版本无效,我不想使用任何旧版本的Elasticsearch。

You need to make sure you have created your ingest pipeline with: 您需要确保已使用以下方法创建了摄取管道:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

Then you can make a PUT not POST to your index using the pipeline you've created. 然后,您可以使用您创建的管道对您的索引进行PUT而不是POST

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

In your example, should be something like: 在您的示例中,应该是这样的:

curl -H 'Content-Type: application/pdf' -XPUT localhost:9200/test/1?pipeline=attachment -d @/cygdrive/c/test/test.pdf

Remembering that the PDF content must be base64 encoded. 请记住,PDF内容必须是base64编码的。

Hope it will help you. 希望它会对你有所帮助。

Edit 1 Please make sure to read these, it helped me a lot: 编辑1请务必阅读这些,这对我帮助很大:

Elastic Ingest 弹性摄取

Ingest Plugin 摄取插件

Ingest Presentation 摄取演示文稿

Edit 2 编辑2

Also, you must have ingest-attachment plugin installed. 此外,您必须安装ingest-attachment插件。

./bin/elasticsearch-plugin install ingest-attachment

Edit 3 编辑3

Please, before you create your ingest processor (attachment), create your index , map with the fields you will use and make sure you have the data field in your map (same name of the "field" in your attachment processor), so ingest will process and fullfill your data field with your pdf content. 在创建摄取处理器 (附件)之前,请创建索引 ,使用您将使用的字段进行映射 ,并确保您的地图中数据字段(附件处理器中“字段”的名称相同),因此请参考将使用您的pdf内容处理和填写您的数据字段。

I inserted the indexed_chars option in the ingest processor, with -1 value, so you can index large pdf files. 我在摄取处理器中插入了indexed_chars选项,其值为-1 ,因此您可以索引大型pdf文件。

Edit 4 编辑4

The mapping should be something like that: 映射应该是这样的:

PUT my_index
{ 
    "mappings" : { 
        "my_type" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "brazilian" 
                } 
            } 
        } 
    } 
}

In this case, I use the brazilian filter, but you can remove that or use your own. 在这种情况下,我使用巴西过滤器,但你可以删除它或使用自己的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM