简体   繁体   中英

How can I strip the following values with numerous amounts of spaces in-between each value in this .txt file?

I am trying to strip relevant information from this.txt file. There are numerous amounts of.txt files to strip from so creating a function or class or an ideal automated approach is needed to automate this task. This is the.txt file I am working with:

NET CASH                      3575.50    NET CASH                      3575.50    CASH SALES                    3575.50
LESS COMMISSIONS               691.03    NET CREDIT                   13429.59    CASH REFUNDS                      .00
                                         TOTAL                        17005.09    NET ADJUSTMENTS                   .00
NET REMITTANCE                2884.47                                             AAD S                             .00
                                         FARES                        12442.61    NET CASH                      3575.50
                                         TOTAL                        17005.09    CREDIT REFUNDS                    .00

My goal is to strip the NET CASH, CASH SALES, LESS COMMISSIONS, NET CREDIT, TOTAL, NET REMITTANCE primarily. How should I approach this? I have noticed that each value is separated after.(digit)(digit) followed by " " amount of spaces.

This is the code I am using so far to create this.txt file:

file1 = open('MasterFile.txt', 'r')

file2 = open('StrippedMasterFileUpdated.txt', 'w')

for lines in file1.readlines():
    if lines.__contains__('NET CASH') or  lines.__contains__('TOTAL') or lines.__contains__('CASH SALES') or lines.__contains__('NET CASH') or lines.__contains__('LESS COMMISSIONS') or lines.__contains__('NET REMITTANCE'):
        file2.write(lines)

Here is the expected output:

3575.50\n 3575.50\n 691.03\n 13429.59\n 13429.59\n 17005.09\n 2884.47

Anything helps!

I'd use regex to solve this.

First things first check remove all the whitespace in each line:

line = re.sub(r"[\s]+", " ", line)

Next check for your desired words and the numbers that follow:

regex = r"(NET CASH|CASH SALES|LESS COMMISSIONS|NET CREDIT|TOTAL|NET REMITTANCE)( [0-9]+\.[0-9]+)"
results = re.findall(regex, line)

Results then contains a list of all the matches. Running this on the first line of your example gives:

print(results) # Prints -> [('NET CASH', '3575.50'), ('NET CASH', '3575.50'), ('CASH SALES', '3575.50')]

Then you can process this data as needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM