简体   繁体   中英

Alternate of PyPDF2

I am extracting the text from a .pdf file using PyPDF2 package. I am getting output but not as in it's desired form. I am unable to find where's the problem?

The code snippet is as follows:

import PyPDF2
def Read(startPage, endPage):
    global text
    text = []
    cleanText = " "
    pdfFileObj = open('F:\\Pen Drive 8 GB\\PDF\\Handbooks\\book1.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    print(num_pages)
    while (startPage <= endPage):
        pageObj = pdfReader.getPage(startPage)
        text += pageObj.extractText()
        startPage += 1
    pdfFileObj.close()
    for myWord in text:
        if myWord != '\n':
            cleanText += myWord
    text = cleanText.strip().split()
    print(text)

Read(3, 3)

The output which I am getting at present is attached for the reference and which is as follows:

在此处输入图片说明

Any help is highly appreciated.

this line cleanText += myWord just concatenates all of the words to one long string. if you want to filter '\\n' , instead of:

for myWord in text:
        if myWord != '\n':
            cleanText += myWord
    text = cleanText.strip().split()

you can do this:

text = [w for w in text if w != '\n']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM