简体   繁体   English

将PDF文件转换为Base64以索引到Elasticsearch

[英]Converting a PDF file to Base64 to index into Elasticsearch

I need to index PDFs to Elasticsearch. 我需要将PDF索引到Elasticsearch。 For that, I need to convert the files to base64. 为此,我需要将文件转换为base64。 I will be using the Attachment Mapping . 我将使用附件映射

I used the following Python code to convert the file to Base64 encoded string: 我使用以下Python代码将文件转换为Base64编码的字符串:

from elasticsearch import Elasticsearch
import base64
import constants

def index_pdf(pdf_filename):
    encoded = ""
    with open(pdf_filename) as f:
        data = f.readlines()
        for line in data:
            encoded += base64.b64encode(f.readline())
    return encoded

if __name__ == "__main__":
    encoded_pdf = index_pdf("Test.pdf")
    INDEX_DSL = {
        "pdf_id": "1",
        "text": encoded_pdf
    }
    constants.ES_CLIENT.index(
            index=constants.INDEX_NAME,
            doc_type=constants.TYPE_NAME,
            body=INDEX_DSL,
            id="1"
    )

The creation of index as well as document indexing works fine. 索引的创建以及文档索引工作正常。 Only issue is that I don't think that the file has been encoded in a right way. 唯一的问题是我认为文件没有以正确的方式编码。 I tried encoding that file using online tools and I get a completely different encoding which is bigger as compared to the one I get using Python. 我尝试使用在线工具对该文件进行编码,并且我得到了一个完全不同的编码,与我使用Python的编码相比,这个编码更大。

Here is the PDF file. 这是PDF文件。

I tried Querying the text data as suggested in the Documentation of the plugin. 我尝试按照插件文档中的建议查询文本数据。

GET index_pdf/pdf/_search
{
  "query": {
    "match": {
      "text": "piece text"
    }
  }
}

I gives my zero hits. 我点击率为零。 How should I go about it? 我该怎么办呢?

The encoding snippet is incorrect it is opening the pdf file in "text" mode. 编码片段不正确,它以“文本”模式打开pdf文件。

Depending on the file size you could just open the file in binary mode and use the encode string method Example: 根据文件大小,您只需以二进制模式打开文件并使用编码字符串方法示例:

def pdf_encode(pdf_filename):
    return open(pdf_filename,"rb").read().encode("base64");

or if the file size is large you could have to break the encoding into chunks did not look into if there is module to do so but it could be as simple as the below example Code: 或者如果文件大小很大,你可能不得不将编码分解成块,但没有查看是否有模块这样做但它可以像下面的示例代码一样简单:

 def chunk_24_read(pdf_filename) :
    with open(pdf_filename,"rb") as f:
        byte = f.read(3)
        while(byte) :
            yield  byte
            byte = f.read(3)


def pdf_encode(pdf_filename):
    encoded = ""
    length = 0
    for data in chunk_24_read(pdf_filename):
        for char in base64.b64encode(data) :
            if(length  and  length % 76 == 0):
               encoded += "\n"
               length = 0

            encoded += char  
            length += 1
    return encoded

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM