简体   繁体   中英

Search Multiple words from pdf

I'm trying to write a Python Script which will Find specific words in pdf files. Right now I have to scroll through the result to find the lines where its found.

I want the lines containing the word alone to be printed or saved as a separate file.

# import packages
import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("Filename.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
Strings = "House|Property|street"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(Strings, Text)
    print(ResSearch)

When I run the above code I need to scroll through the output to find the lines where the words are found. I expect the lines containing the words to be printed or saved as separate file or the page containing the line alone to be saved in separate pdf or txt file. Thanks for the help in advance

You could use re.match after splitting lines for the text on each page.

As an example:

for i in range(0, num_pages):
    page = object.getPage(i)
    text = page.extractText()
    for line in text.splitlines():
        if re.match('House|Property|street', line):
            print(line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM