Converting a PDF file to Base64 to index into Elasticsearch

Question

I need to index PDFs to Elasticsearch. For that, I need to convert the files to base64. I will be using the Attachment Mapping .

I used the following Python code to convert the file to Base64 encoded string:

from elasticsearch import Elasticsearch
import base64
import constants

def index_pdf(pdf_filename):
    encoded = ""
    with open(pdf_filename) as f:
        data = f.readlines()
        for line in data:
            encoded += base64.b64encode(f.readline())
    return encoded

if __name__ == "__main__":
    encoded_pdf = index_pdf("Test.pdf")
    INDEX_DSL = {
        "pdf_id": "1",
        "text": encoded_pdf
    }
    constants.ES_CLIENT.index(
            index=constants.INDEX_NAME,
            doc_type=constants.TYPE_NAME,
            body=INDEX_DSL,
            id="1"
    )

The creation of index as well as document indexing works fine. Only issue is that I don't think that the file has been encoded in a right way. I tried encoding that file using online tools and I get a completely different encoding which is bigger as compared to the one I get using Python.

Here is the PDF file.

I tried Querying the text data as suggested in the Documentation of the plugin.

GET index_pdf/pdf/_search
{
  "query": {
    "match": {
      "text": "piece text"
    }
  }
}

I gives my zero hits. How should I go about it?

Answer 1

The encoding snippet is incorrect it is opening the pdf file in "text" mode.

Depending on the file size you could just open the file in binary mode and use the encode string method Example:

def pdf_encode(pdf_filename):
    return open(pdf_filename,"rb").read().encode("base64");

or if the file size is large you could have to break the encoding into chunks did not look into if there is module to do so but it could be as simple as the below example Code:

 def chunk_24_read(pdf_filename) :
    with open(pdf_filename,"rb") as f:
        byte = f.read(3)
        while(byte) :
            yield  byte
            byte = f.read(3)


def pdf_encode(pdf_filename):
    encoded = ""
    length = 0
    for data in chunk_24_read(pdf_filename):
        for char in base64.b64encode(data) :
            if(length  and  length % 76 == 0):
               encoded += "\n"
               length = 0

            encoded += char  
            length += 1
    return encoded

Converting a PDF file to Base64 to index into Elasticsearch

Question

1 answers

solution1
3 ACCPTED 2015-07-09 18:09:33

Converting a PDF file to Base64 to index into Elasticsearch

Question

1 answers

solution1 3 ACCPTED 2015-07-09 18:09:33

solution1
3 ACCPTED 2015-07-09 18:09:33