简体   繁体   中英

Is it possible to extract a pdf with its white spaces in Python?

I have been attempting to extract a pdf with Python after a tool was created to extract it using java and pdfbox.

While the Java implementation was successful for the same pdf, I have been struggling to do the same in python since both pdfminer and pypdf, and pypdf2 have not be able to extract the pdf line by line with spaces. In particular, pdfminer pdf2txt for some bizarre reason split the pdf in 3 columns and then read line by line.

The closest I've gotten was using the implementation of a stack overflow question which unfortunately does not keep the spaces. Given that I have variables that both have numbers, I am being unable to recover them in text form.

Given this, is it possible to extract a pdf with its white spaces in Python line by line?

Following works in my case:

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path("sample.pdf")
for i,image in enumerate(images,start=1):
    image.save(f"./images/page_{i}.jpg","JPEG")

print(pytesseract.image_to_string("./images/page_1.jpg"))

The idea here is to first convert the PDF to an image and then read the text from it. This approach preserves the whitespace.

Dependecies:

  • conda install -c conda-forge tesseract
  • conda install pdf2image
  • conda install pytesseract

You can use Aspose.PDF Cloud SDK for Python to extract text from PDF line by line along with whitespaces. Currently, It supports file processing from Cloud storage(Amazon S3, DropBox, Google Drive Storage, Google Cloud Storage, Windows Azure Storage, FTP Storage and Aspose default Cloud Storage).

Here is sample code:

import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi

# Get Client Id and Client Secret from https://cloud.aspose.com
pdf_api_client = asposepdfcloud.api_client.ApiClient(
    app_key='xxxxxxxxxxxxxxxxxx',
    app_sid='xxxx-xxxx-xxxx-xxxx-xxxxxxxxxx')

pdf_api = PdfApi(pdf_api_client)
temp_folder="Temp"

#upload PDF file to storage
data_file = "C:/Temp/02_pages.pdf"
remote_name="02_pages.pdf"
pdf_api.upload_file(temp_folder + '/' + remote_name,data_file)

llx = 0
lly = 0
urx = 0
ury = 0

response = pdf_api.get_text(remote_name, llx, lly, urx, ury, folder= temp_folder)

for i in response.text_occurrences.list:
    print(i.text)

PS: I'm a developer evangelist at Aspose

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM