newline in text extraction from pdf

Question

I am coding a function about extracting text in pdf, I am also using the pyPdf library. Extracting was okay. But I am encountering a couple of problems like it excluding the newline.

So I find a way to add a newline, so I have done this:

# Iterate pages
for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(i).extractText()
    content = content.replace('. ', '. <br />')
    pages += content

# Collapse whitespace
content = " ".join(pages.replace(u"\xa0", " ").strip().split())

The problem is even instances like this:

1. Apple

became like this:

1.

Apple

Which it shouldn't be. I just want to add newline on every end of a sentence.

Is there a way to check or determine when the sentence ends? Or checking whether it is as numbering?

Answer 1

A hackish solution is to perform replacement only if the full stop is not immediately preceded by a digit. Change the line content = content.replace('. ', '. <br />') to the following:

import re

re.sub(r'([^0-9])\. ', r"\1. <br />", content)

Answer 2

Why not use re.sub()?

For a dot ended line and probably with some spaces, the pattern should be ".\\s*$", ie,

import re
:

content = re.sub('\.\s*$', '. <br />', content)

Answer 3

pyPdf is great for some things, but not really good at text extraction. Have a look at the pdfminer library. Or use a tool like pdftotext.

newline in text extraction from pdf

Question

3 answers

solution1
2 ACCPTED 2014-02-07 09:25:58

solution2
0 2014-02-07 08:43:50

solution3
0 2014-02-07 13:59:21

newline in text extraction from pdf

Question

3 answers

solution1 2 ACCPTED 2014-02-07 09:25:58

solution2 0 2014-02-07 08:43:50

solution3 0 2014-02-07 13:59:21

solution1
2 ACCPTED 2014-02-07 09:25:58

solution2
0 2014-02-07 08:43:50

solution3
0 2014-02-07 13:59:21