I am trying to strip relevant information from this.txt file. There are numerous amounts of.txt files to strip from so creating a function or class or an ideal automated approach is needed to automate this task. This is the.txt file I am working with:
NET CASH 3575.50 NET CASH 3575.50 CASH SALES 3575.50
LESS COMMISSIONS 691.03 NET CREDIT 13429.59 CASH REFUNDS .00
TOTAL 17005.09 NET ADJUSTMENTS .00
NET REMITTANCE 2884.47 AAD S .00
FARES 12442.61 NET CASH 3575.50
TOTAL 17005.09 CREDIT REFUNDS .00
My goal is to strip the NET CASH, CASH SALES, LESS COMMISSIONS, NET CREDIT, TOTAL, NET REMITTANCE primarily. How should I approach this? I have noticed that each value is separated after.(digit)(digit) followed by " " amount of spaces.
This is the code I am using so far to create this.txt file:
file1 = open('MasterFile.txt', 'r')
file2 = open('StrippedMasterFileUpdated.txt', 'w')
for lines in file1.readlines():
if lines.__contains__('NET CASH') or lines.__contains__('TOTAL') or lines.__contains__('CASH SALES') or lines.__contains__('NET CASH') or lines.__contains__('LESS COMMISSIONS') or lines.__contains__('NET REMITTANCE'):
file2.write(lines)
Here is the expected output:
3575.50\n 3575.50\n 691.03\n 13429.59\n 13429.59\n 17005.09\n 2884.47
Anything helps!
I'd use regex to solve this.
First things first check remove all the whitespace in each line:
line = re.sub(r"[\s]+", " ", line)
Next check for your desired words and the numbers that follow:
regex = r"(NET CASH|CASH SALES|LESS COMMISSIONS|NET CREDIT|TOTAL|NET REMITTANCE)( [0-9]+\.[0-9]+)"
results = re.findall(regex, line)
Results then contains a list of all the matches. Running this on the first line of your example gives:
print(results) # Prints -> [('NET CASH', '3575.50'), ('NET CASH', '3575.50'), ('CASH SALES', '3575.50')]
Then you can process this data as needed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.