简体   繁体   中英

how to write code to extract a specific text and integer on the same line from a pdf file using python?

The below is the data I am having in a pdf file where I would like to extract the integer number 100 in the line "US stock price 100" using Keyword as "US stock price" using python?

****PDF FILE LINES BELOW*****

sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. 
Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? 
Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur
US stock price     100
"Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, 
totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. 
Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. 
Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, 
Abb price     50

Below is the code i have used for the text extraction:

import PyPDF2
pdfFileObject = open(path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    Text=page.extractText()
    print(Text)

You can try using the package tika .

from tika import parser

raw = parser.from_file('test.pdf')
print(raw['myText'])

Below is the code to search for the keyword in PDF file.

import PyPDF2
import re

object = PyPDF2.PdfFileReader("test.pdf")
numPages = object.getNumPages()
string = "US stock price"
for i in range(0, numPages):
    pageObj = object.getPage(i)
    print("this is page " + str(i)) 
    txt = pageObj.extractText() 
    resSearch = re.search(string, txt)
    print(resSearch)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM