简体   繁体   English

从 pdf 中搜索多个单词

[英]Search Multiple words from pdf

I'm trying to write a Python Script which will Find specific words in pdf files.我正在尝试编写一个 Python 脚本,它将在 pdf 文件中查找特定单词。 Right now I have to scroll through the result to find the lines where its found.现在我必须滚动结果以找到找到它的行。

I want the lines containing the word alone to be printed or saved as a separate file.我希望单独打印包含单词的行或将其保存为单独的文件。

# import packages
import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("Filename.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
Strings = "House|Property|street"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(Strings, Text)
    print(ResSearch)

When I run the above code I need to scroll through the output to find the lines where the words are found.当我运行上面的代码时,我需要滚动浏览 output 以找到找到单词的行。 I expect the lines containing the words to be printed or saved as separate file or the page containing the line alone to be saved in separate pdf or txt file.我希望将包含单词的行打印或保存为单独的文件,或者将仅包含该行的页面保存在单独的 pdf 或 txt 文件中。 Thanks for the help in advance我在这里先向您的帮助表示感谢

You could use re.match after splitting lines for the text on each page.您可以在为每页上的文本拆分行后使用re.match

As an example:举个例子:

for i in range(0, num_pages):
    page = object.getPage(i)
    text = page.extractText()
    for line in text.splitlines():
        if re.match('House|Property|street', line):
            print(line)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM