简体   繁体   中英

How to Extract data from specified position to some lines by python regex

thanks in advance I have some text files as input in which data is in some pattern I want to extract the data from these text files. My code is working for many files but it fails at one point so I would like some help

first format where i am extracting data between Total and Words

Total
254285.00
45771.30
300056 30
Amount in word:

Second format where my code fails i want to extract 3 values before Total words

Original/DuplicatesTriplicate TAX INVOICE (Under Rule 46 of the Central Goods & Service Tax Rules, 2017) Page 1 of 1 KS LINGAPPA AND SON Industrial Area, Plot No 14. KSSIDC TBDam Road. Hosapete-
583201 State: Karnataka
State Code: 29
GSTIN: 29AAEFK8072G122 Phone: PAN: AAEFK8072G CIN
Invoice No: OS/20-21/5
Invoice Date: 29/08/2020
Bill To: Recepient Code: GSCPL Recepient Name: GREAT SANDS CONSULTING PRIVATE
LIMITED Address: 70, TUMKUR ROAD,YESHWANTHPUR,BANGALURU(Bangalore) Urban
Karnataka, 560022 GSTIN: 29AAECG5355M1Z3
State: Karnataka PAN: AAECG5355M
State Code: 29 Place of Supply: Karnataka
me Urbane Karnataka, 560022
Reverse Charge Applicable - N

SAC & Description
Total Tax Total Amount
Taxable Val SGST/
UTGST
Rate
SGSTI CGST UTGST Rate Amount
CGST Amount
IGST Rate
IGST Amount
11
998599 & Market Devlopment
47076.00
55549.68
8473.68
9.00
0.00
4236.84
9.00
4236.84
0.00
8473.68
55549.58
47076.00
Total:
Amount in words: Rupees Fifty Five Thousand Five Hundred Fourty Nine & Paisa
Sixty Eight Only
FO-KS LINGAS DINGAPA ASOSA
Ranil Jos
Partner
Authorised Signature

if line.strip() == "Total":
        copy = True
        continue
     if line.strip() == "Total:":
        copy = True
        continue
     elif line.strip() == "Amount":
        copy = False
        continue

     elif copy:
        cnt=cnt+1
        if cnt==1:
          Taxable_Value.append(line)
        if cnt==2:
          Total_Tax.append(line)
        if cnt==3:
          Total_Amount.append(line)
          break

Please try below regex to match the numbers, and take out the matched group.

((?:[\d. ]+\s){3})?Total:?\s((?:[\d. ]+\s)*)Amount

Code

import re
case1="""
thanks in advance I have some text files as input in which data is in some pattern I want to extract the data from these text files. My code is working for many files but it fails at one point so I would like some help

first format where i am extracting data between Total and Words

Total
254285.00
45771.30
300056 30
Amount in word:

Second format where my code fails i want to extract 3 values before Total words

Original/DuplicatesTriplicate TAX INVOICE (Under Rule 46 of the Central Goods & Service Tax Rules, 2017) Page 1 of 1 KS LINGAPPA AND SON Industrial Area, Plot No 14. KSSIDC TBDam Road. Hosapete-
583201 State: Karnataka
State Code: 29
GSTIN: 29AAEFK8072G122 Phone: PAN: AAEFK8072G CIN
Invoice No: OS/20-21/5
Invoice Date: 29/08/2020
Bill To: Recepient Code: GSCPL Recepient Name: GREAT SANDS CONSULTING PRIVATE
LIMITED Address: 70, TUMKUR ROAD,YESHWANTHPUR,BANGALURU(Bangalore) Urban
Karnataka, 560022 GSTIN: 29AAECG5355M1Z3
State: Karnataka PAN: AAECG5355M
State Code: 29 Place of Supply: Karnataka
me Urbane Karnataka, 560022
Reverse Charge Applicable - N

SAC & Description
Total Tax Total Amount
Taxable Val SGST/
UTGST
Rate
SGSTI CGST UTGST Rate Amount
CGST Amount
IGST Rate
IGST Amount
11
998599 & Market Devlopment
47076.00
55549.68
8473.68
9.00
0.00
4236.84
9.00
4236.84
0.00
8473.68
55549.58
47076.00
Total:
Amount in words: Rupees Fifty Five Thousand Five Hundred Fourty Nine & Paisa
Sixty Eight Only
FO-K.S. LINGAS DINGAPA ASOSA
Ranil Jos
Partner
Authorised Signature
"""

output=re.findall("((?:[\d. ]+\s){3})?Total:?\s((?:[\d. ]+\s)*)Amount",case1)
for o in output:
    print(o[0] or o[1])

Output

254285.00
45771.30
300056 30


8473.68
55549.58
47076.00

Regex Demo

In case you are reading content from a file, use below code.

import re
with open("test.txt","r") as f:
    case1=f.read();
    output=re.findall("((?:[\d. ]+\s){3})?Total:?\s((?:[\d. ]+\s)*)Amount",case1)
    for o in output:
        print(o[0] or o[1])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM