简体   繁体   中英

PDF - Split Single Words into Individual Lines - Python 3

I am trying to extract words from a PDF into individual lines, but can only do this with Text files as demonstrated below.

Moreover, the rule is that I cannot convert PDF files to TXT then perform this operation. It must be done on PDF files.

with open('filename.txt','r') as f:
    for line in f:
        for word in line.split():

If filename.txt has just "Hello World!", then this function returns:


I need to do the same with searchable PDF files as well. Any help would be appreciated.

For the PDF, you should use pdf.miner or PyPDF2.

Here is a good article you can use to extract the text, and then you can use Anilkumar's method to extract line by line.


Check out PyMuPDF . There's loads of stuff you can do, including get line by line text from a PDF using page.getText()

You can use pdfreader to extract texts (plain and containing PDF operators) from PDF document

Here is a sample code extracting all the above from all document pages.

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""
    while True:
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
except PageDoesNotExist:

Just want to outline, that text in PDFs usually do not come as "words", they look like commands to a conforming PDF viewer where and how to put a glyph. Which means a single word may be displayed by several commands. Read more on that in PDF 1.7 docs sec.9 - Text

when I saw filename.txt I got confused.

Since you are working with PDF below link might be helpful. See it helps

How to use PDFminer.six with python 3?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM