thanks in advance I have some text files as input in which data is in some pattern I want to extract the data from these text files. My code is working for many files but it fails at one point so I would like some help
first format where i am extracting data between Total and Words
Total
254285.00
45771.30
300056 30
Amount in word:
Second format where my code fails i want to extract 3 values before Total words
Original/DuplicatesTriplicate TAX INVOICE (Under Rule 46 of the Central Goods & Service Tax Rules, 2017) Page 1 of 1 KS LINGAPPA AND SON Industrial Area, Plot No 14. KSSIDC TBDam Road. Hosapete-
583201 State: Karnataka
State Code: 29
GSTIN: 29AAEFK8072G122 Phone: PAN: AAEFK8072G CIN
Invoice No: OS/20-21/5
Invoice Date: 29/08/2020
Bill To: Recepient Code: GSCPL Recepient Name: GREAT SANDS CONSULTING PRIVATE
LIMITED Address: 70, TUMKUR ROAD,YESHWANTHPUR,BANGALURU(Bangalore) Urban
Karnataka, 560022 GSTIN: 29AAECG5355M1Z3
State: Karnataka PAN: AAECG5355M
State Code: 29 Place of Supply: Karnataka
me Urbane Karnataka, 560022
Reverse Charge Applicable - N
SAC & Description
Total Tax Total Amount
Taxable Val SGST/
UTGST
Rate
SGSTI CGST UTGST Rate Amount
CGST Amount
IGST Rate
IGST Amount
11
998599 & Market Devlopment
47076.00
55549.68
8473.68
9.00
0.00
4236.84
9.00
4236.84
0.00
8473.68
55549.58
47076.00
Total:
Amount in words: Rupees Fifty Five Thousand Five Hundred Fourty Nine & Paisa
Sixty Eight Only
FO-KS LINGAS DINGAPA ASOSA
Ranil Jos
Partner
Authorised Signature
if line.strip() == "Total":
copy = True
continue
if line.strip() == "Total:":
copy = True
continue
elif line.strip() == "Amount":
copy = False
continue
elif copy:
cnt=cnt+1
if cnt==1:
Taxable_Value.append(line)
if cnt==2:
Total_Tax.append(line)
if cnt==3:
Total_Amount.append(line)
break
Please try below regex to match the numbers, and take out the matched group.
((?:[\d. ]+\s){3})?Total:?\s((?:[\d. ]+\s)*)Amount
Code
import re
case1="""
thanks in advance I have some text files as input in which data is in some pattern I want to extract the data from these text files. My code is working for many files but it fails at one point so I would like some help
first format where i am extracting data between Total and Words
Total
254285.00
45771.30
300056 30
Amount in word:
Second format where my code fails i want to extract 3 values before Total words
Original/DuplicatesTriplicate TAX INVOICE (Under Rule 46 of the Central Goods & Service Tax Rules, 2017) Page 1 of 1 KS LINGAPPA AND SON Industrial Area, Plot No 14. KSSIDC TBDam Road. Hosapete-
583201 State: Karnataka
State Code: 29
GSTIN: 29AAEFK8072G122 Phone: PAN: AAEFK8072G CIN
Invoice No: OS/20-21/5
Invoice Date: 29/08/2020
Bill To: Recepient Code: GSCPL Recepient Name: GREAT SANDS CONSULTING PRIVATE
LIMITED Address: 70, TUMKUR ROAD,YESHWANTHPUR,BANGALURU(Bangalore) Urban
Karnataka, 560022 GSTIN: 29AAECG5355M1Z3
State: Karnataka PAN: AAECG5355M
State Code: 29 Place of Supply: Karnataka
me Urbane Karnataka, 560022
Reverse Charge Applicable - N
SAC & Description
Total Tax Total Amount
Taxable Val SGST/
UTGST
Rate
SGSTI CGST UTGST Rate Amount
CGST Amount
IGST Rate
IGST Amount
11
998599 & Market Devlopment
47076.00
55549.68
8473.68
9.00
0.00
4236.84
9.00
4236.84
0.00
8473.68
55549.58
47076.00
Total:
Amount in words: Rupees Fifty Five Thousand Five Hundred Fourty Nine & Paisa
Sixty Eight Only
FO-K.S. LINGAS DINGAPA ASOSA
Ranil Jos
Partner
Authorised Signature
"""
output=re.findall("((?:[\d. ]+\s){3})?Total:?\s((?:[\d. ]+\s)*)Amount",case1)
for o in output:
print(o[0] or o[1])
Output
254285.00
45771.30
300056 30
8473.68
55549.58
47076.00
In case you are reading content from a file, use below code.
import re
with open("test.txt","r") as f:
case1=f.read();
output=re.findall("((?:[\d. ]+\s){3})?Total:?\s((?:[\d. ]+\s)*)Amount",case1)
for o in output:
print(o[0] or o[1])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.