简体   繁体   中英

ElasticSearch 5.0.0 ingest-attachment plugin issues to index PDF

See this post

My Env:

{   "name" : "node-0",
    "cluster_name" : "ES500-JBD-0",  
    "cluster_uuid" : "q_akJRkrSI-glTwT5vfH4A",  
  "version" : {
    "number" : "5.0.0",
    "build_hash" : "253032b",
    "build_date" : "2016-10-26T04:37:51.531Z",
    "build_snapshot" : false,
    "lucene_version" : "6.2.0"   },
  "tagline" : "You Know, for Search"
}

Index & pipeline creation ( Edit 3 ):

curl -XPUT 'vm01.jbdata.fr:9200/_ingest/pipeline/attachment' -d '{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}'

Mapping creation ( Edit 4 ) with french :

curl -XPUT 'vm01.jbdata.fr:9200/ged-idx-00' -d '{
  "mappings" : {
    "ged_type_0" : {
      "properties" : {
         "attachment.data" : {
            "type": "text",
            "analyzer" : "french"
            }
         }
      }
   }
}'

ES specific config ( Edit 1 & Edit 2 ):

$ bin/elasticsearch-plugin list
ingest-attachment

From config/elasticsearch.yml

plugin.mandatory: ingest-attachment

Command S to index a PDF:

1/ A "raw" PDF.

curl -H 'Content-Type: application/pdf' -XPUT vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment -d @/tmp/zookeeperAdmin.pdf

{" error ":{"root_cause":[{"type":"settings_exception","reason":"Failed to load settings from [%PDF-1.4% ... 0D33957F>]>>startxref76764%%EOF; line: 1, column: 2]"}}," status ": 500 }

2/ A "B64ed" PDF.

aPath='/tmp/zookeeperAdmin.pdf'
aB64content=$(base64 $aPath | perl -pe 's/\n/\\n/g')
echo $aB64content > /tmp/zookeeperAdmin.pdf.b64
curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "file" : "content" : "'$aB64content'"
}'

{" error ":{"root_cause":... "reason":"failed to parse source for create index","caused_by":{"type":"json_parse_exception","reason":"Unexpected character (':' (code 58)): was expecting comma to separate Object entries\\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@65a254b6; line: 2, column: 25]"}}," status ": 400 }

How to use correctly the ingest-attachment plugin ton index PDF ?

From my experience, the file needs to be encoded in Base64, so your option 2 should be the good way to go.

About your last attempt:

curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "file" : "content" : "'$aB64content'"
}'

The provided JSON is malformed ("a" : "b" : "c"), hence the error.

As specified in your pipeline creation, you only need a data field, so the following should do the trick:

curl -XPUT "http://vm01.jbdata.fr:9200/ged-idx-00?pipeline=attachment" -d '{
    "data" : "'$aB64content'"
}'

In fact, it's quite difficult to extract text from PDF properly, often you have to extract inline images or render the whole page and OCR it depending on the text extracted from the page and it's content (for example you have to analyse whether encoding is right or not). You simply can not tune Tika to use any custom logic inside parsing process, neither you can't do so with Ingest Attachment. If you're aiming at a good quality PDF parsing - Ingest Attachment is not what you're looking for, you have to do it yourself.

Read the full story here: https://blog.ambar.cloud/ingest-attachment-plugin-for-elasticsearch-should-you-use-it/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM