简体   繁体   English

ElasticSearch 5.0.0摄取附件插件问题将PDF编入索引

[英]ElasticSearch 5.0.0 ingest-attachment plugin issues to index PDF

See this post 看到这个帖子

My Env: 我的环境:

{   "name" : "node-0",
    "cluster_name" : "ES500-JBD-0",  
    "cluster_uuid" : "q_akJRkrSI-glTwT5vfH4A",  
  "version" : {
    "number" : "5.0.0",
    "build_hash" : "253032b",
    "build_date" : "2016-10-26T04:37:51.531Z",
    "build_snapshot" : false,
    "lucene_version" : "6.2.0"   },
  "tagline" : "You Know, for Search"
}

Index & pipeline creation ( Edit 3 ): 索引和管道创建( 编辑3 ):

curl -XPUT 'vm01.jbdata.fr:9200/_ingest/pipeline/attachment' -d '{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}'

Mapping creation ( Edit 4 ) with french : 法语映射创建( 编辑4 ):

curl -XPUT 'vm01.jbdata.fr:9200/ged-idx-00' -d '{
  "mappings" : {
    "ged_type_0" : {
      "properties" : {
         "attachment.data" : {
            "type": "text",
            "analyzer" : "french"
            }
         }
      }
   }
}'

ES specific config ( Edit 1 & Edit 2 ): ES特定的配置( 编辑1编辑2 ):

$ bin/elasticsearch-plugin list
ingest-attachment

From config/elasticsearch.yml 来自config / elasticsearch.yml

plugin.mandatory: ingest-attachment

Command S to index a PDF: 命令S为PDF编制索引:

1/ A "raw" PDF. 1 /一个“原始” PDF。

curl -H 'Content-Type: application/pdf' -XPUT vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment -d @/tmp/zookeeperAdmin.pdf

{" error ":{"root_cause":[{"type":"settings_exception","reason":"Failed to load settings from [%PDF-1.4% ... 0D33957F>]>>startxref76764%%EOF; line: 1, column: 2]"}}," status ": 500 } {“ error ”:{“ root_cause”:[{“ type”:“ settings_exception”,“ reason”:“无法从[%PDF-1.4%.... ... 0D33957F>] >> startxref76764 %%加载设置EOF;行:1,列:2]“}},” 状态 “: 500 }

2/ A "B64ed" PDF. 2 /一个“ B64ed” PDF。

aPath='/tmp/zookeeperAdmin.pdf'
aB64content=$(base64 $aPath | perl -pe 's/\n/\\n/g')
echo $aB64content > /tmp/zookeeperAdmin.pdf.b64
curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "file" : "content" : "'$aB64content'"
}'

{" error ":{"root_cause":... "reason":"failed to parse source for create index","caused_by":{"type":"json_parse_exception","reason":"Unexpected character (':' (code 58)): was expecting comma to separate Object entries\\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@65a254b6; line: 2, column: 25]"}}," status ": 400 } {“ 错误 ”:{“ root_cause”:...“原因”:“无法解析创建索引的源”,“ caused_by”:{“ type”:“ json_parse_exception”,“原因”:“意外字符(': '(代码58)):希望逗号分隔[\\ n在[源:org.elasticsearch.transport.netty4.ByteBufStreamInput@65a254b6;行:2,列:25]“}},” 状态 “: 400 }

How to use correctly the ingest-attachment plugin ton index PDF ? 如何正确使用摄取附件插件吨索引PDF?

From my experience, the file needs to be encoded in Base64, so your option 2 should be the good way to go. 根据我的经验,该文件需要在Base64中进行编码,因此您的选项2应该是不错的选择。

About your last attempt: 关于您的最后尝试:

curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "file" : "content" : "'$aB64content'"
}'

The provided JSON is malformed ("a" : "b" : "c"), hence the error. 提供的JSON格式不正确(“ a”:“ b”:“ c”),因此出现错误。

As specified in your pipeline creation, you only need a data field, so the following should do the trick: 正如在管道创建中所指定的那样,您只需要一个数据字段,因此以下操作可以解决问题:

curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "data" : "'$aB64content'"
}'

In fact, it's quite difficult to extract text from PDF properly, often you have to extract inline images or render the whole page and OCR it depending on the text extracted from the page and it's content (for example you have to analyse whether encoding is right or not). 实际上,从PDF中正确提取文本非常困难,通常您必须提取嵌入式图像或渲染整个页面,然后根据从页面提取的文本及其内容对其进行OCR(例如,您必须分析编码是否正确)或不)。 You simply can not tune Tika to use any custom logic inside parsing process, neither you can't do so with Ingest Attachment. 您根本无法调整Tika在解析过程中使用任何自定义逻辑,也无法通过Ingest Attachment进行调整。 If you're aiming at a good quality PDF parsing - Ingest Attachment is not what you're looking for, you have to do it yourself. 如果您希望获得高质量的PDF解析-并非您所需要的是Ingest Attachment,那么您必须自己做。

Read the full story here: https://blog.ambar.cloud/ingest-attachment-plugin-for-elasticsearch-should-you-use-it/ 在此处阅读全文: https : //blog.ambar.cloud/ingest-attachment-plugin-for-elasticsearch-should-you-use-it/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM