简体   繁体   中英

Extract only the specific value from string with Regex Using Python

I am trying to extract Specific text values from string using regex but due to not having the spaces between the start of the keyword from which the values need to be extracted getting the error. Looking out to extract the values of the keywords starts with.

Tried using PyPDF2 and pdfminer but getting the Error.

fr = PyPDF2.PdfFileReader(file)
data = fr.getPage(0).extractText()

OutPut : ['Date : 2020-09-06 20:43:00 Ack No : 3320000266 Original for RecipientInvoice No.: IN05200125634 Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITED CIN: K253648B85PLC015063 GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....']

I am looking out to capture Ack No, Date of Issue, CIN from the above output

Using the script:

    regex_ack_no = re.compile(r"Ack No(\d+)")
    regex_due_date = re.compile(r"Date of Issue(\S+ \d{1,2}, \d{4})")
    regex_CIN = re.compile(r"CIN(\$\d+\.\d{1,2})")

ack_no = re.search(regex_ack_no, data).group(1)
due_date = re.search(regex_due_date, data).group(1)
cin = re.search(regex_CIN, data).group(1)

return[ack_no, due_date, cin]

Error:

AttributeError: 'NoneType' object has no attribute 'group'

When using the same script with the another PDF file having data in the table format its working.

You need to change the regexp patterns to match the data format. The keywords are followed by spaces and : , you have to match them. The format of the date is not what you have in your pattern, neither is the format of CIN .

Before calling .group(1) , check that the match was successful. In my code below I return default values when there's no match.

import re

data = 'Date : 2020-09-06 20:43:00Ack No : 3320000266Original for RecipientInvoice No.: IN05200125634Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITEDCIN: K253648B85PLC015063GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....'

regex_ack_no = re.compile(r"Ack No\s*:\s*(\d+)")
regex_due_date = re.compile(r"Date of Issue\s*:\s*(\d\d\.\d\d\.\d{4})")
regex_CIN = re.compile(r"CIN:\s*(\w+?)GSTIN:")

ack_no = re.search(regex_ack_no, data)
if ack_no:
    ack_no = ack_no.group(1)
else:
    ack_no = 'Ack No not found'
due_date = re.search(regex_due_date, data)
if due_date:
    due_date = due_date.group(1)
else:
    due_date = 'Due date not found'
cin = re.search(regex_CIN, data)
if cin:
    cin = cin.group(1)
else:
    cin = 'CIN not found'

print([ack_no, due_date, cin])

DEMO

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM