简体   繁体   中英

newline in text extraction from pdf

I am coding a function about extracting text in pdf, I am also using the pyPdf library. Extracting was okay. But I am encountering a couple of problems like it excluding the newline.

So I find a way to add a newline, so I have done this:

# Iterate pages
for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(i).extractText()
    content = content.replace('. ', '. <br />')
    pages += content

# Collapse whitespace
content = " ".join(pages.replace(u"\xa0", " ").strip().split())

The problem is even instances like this:

1. Apple

became like this:

1.

Apple

Which it shouldn't be. I just want to add newline on every end of a sentence.

Is there a way to check or determine when the sentence ends? Or checking whether it is as numbering?

A hackish solution is to perform replacement only if the full stop is not immediately preceded by a digit. Change the line content = content.replace('. ', '. <br />') to the following:

import re

re.sub(r'([^0-9])\. ', r"\1. <br />", content)

Why not use re.sub()?

For a dot ended line and probably with some spaces, the pattern should be ".\\s*$", ie,

import re
:

content = re.sub('\.\s*$', '. <br />', content)

pyPdf is great for some things, but not really good at text extraction. Have a look at the pdfminer library. Or use a tool like pdftotext.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM