简体   繁体   中英

tips on Parsing a custom file format python

I developed a custom system which simulates web activity, for example downloading files and such. I also have a custom file format to feed into this system. I am looking to change this old system which is written in perl to a newer system in python. But first i have to somehow parse the file.

There are certain fields in the file that I would like to parse, such as the [settings] where I have any arguements for the system. I also have a [macro] section which is the beginning of the important stuff (the steps, etc).

What i have trouble is parsing these sections have my system write it out in a different and much more simpler format (i have thousands of these files and I just want to write a generator to take the old file and write to a new format in a new file).

Old format:

[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0

[macro]
%::WebSurfRules =
    (
    'Step1' =>
        {
        action                  => 'NAVIGATE',
        inputstring             => 'http://www.tda-sgft.com/TdaWeb/jsp/fondos/Fondos.tda',
        },
    'Step2' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'OUTER',
        matchstring   => 'phHttpDest->\{\'FirstClick\'\}',
        pass          => 'phHttpDest->\{\'Step2Pass\'\}',
        },
    'Step3' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'OUTER',
        matchstring   => 'phHttpDest->\{\'SecondClick\'\}',
        },
    'Step4' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'OUTER',
        matchstring   => 'phHttpDest->\{\'DealClick\'\}',
        accept_multi_match  => 'ANY_TOP_FIRST',
        },
    'Step5' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'INNER',
        matchstring   => 'phHttpDest->\{\'LinkClick2\'\}',
        fail          => 'Step6',
    #    accept_multi_match  => 'ANY_TOP_LAST',
        },
    'Step6' =>
        {
        action        => 'CLICK_REFERENCE',
        matchtype     => 'INNER',
        matchstring   => 'phHttpDest->\{\'DocClick\'\}',
        },
    'Step7' =>
        {
        action                  => 'CLICK_DOWNLOAD_OK',
        },
    );

[data]
Print WebAddress______________  Destination_________________________________________________ FirstClick_________________ SecondClick________________    DealClick_________________________   LinkClick2________________________  DocClick___________________________________ PayInterval   DueDay  Step2Pass__________     QaRule_________________________________________________________________________________________________________________
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_apl.pdf              Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Fund´s Allocation                           q1                    Step3                   qa_regexp=Report D?d?ate\\s+\\d\\d\/$MM{$n}\/$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMES[$MM{$n}-1].+$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPSHORTMONTHNAMES[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMESSPANISH[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}

And what i want it to spit out:

[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0
[macro]
%::WebSurfRules =
    (
    '1'     => 'NAVIGATE,phHttpDest->\{\'WebAddress\'\}', 
    '2'     => 'CLICK_REFERENCE,phHttpDest->\{\'FirstClick\'\}',                                                         
    '3'     => 'CLICK_REFERENCE,phHttpDest->\{\'SecondClick\'\}',                                 
    '4'     => 'CLICK_REFERENCE,phHttpDest->\{\'DealClick\'\}',
    '5'     => 'CLICK_REFERENCE,phHttpDest->\{\'LinkClick2\'\}',                     
    '6'     => 'CLICK_REFERENCE,phHttpDest->\{\'DocClick\'\}',           
    );

[data]
Print WebAddress______________  Destination_________________________________________________ FirstClick_________________ SecondClick________________    DealClick_________________________   LinkClick2________________________  DocClick___________________________________ PayInterval   DueDay  Step2Pass__________     QaRule_________________________________________________________________________________________________________________
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_apl.pdf              Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Fund´s Allocation                           q1                    Step3                   qa_regexp=Report D?d?ate\\s+\\d\\d\/$MM{$n}\/$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMES[$MM{$n}-1].+$YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPSHORTMONTHNAMES[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
0     http://www.tda-sgft.com/  d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf             Mortgage Loan               ABS                            Caixa Penedes 1 TDA                  MAINPAGE - FAIL                     Investors information on Payment Date       q1                    Step3                   qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMESSPANISH[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}

Where each of the clicks the phHttpDest and the action correlate to the Headings of the [data] section.

So one way of doing it is using a set of regular expression replacements to create the files in the new format. I didn't completely understand the rules of your format so I generally implemented the whole thing, but there are some differences. You'll have to go in and make some adjustments to fine tune it. The output.txt file is what gets produced when one uses your example as input.txt

code

import re
data = open('input.txt').read()
data = re.sub(r"    'Step([0-9]+)' =>\s+{\s+action\s+=> ", r"    '\1'     => ", data)
data = re.sub(r"',\s+pass\s+[^,]+,", "", data)
data = re.sub(r"',\s+accept_multi_match\s+[^,]+,", "", data)
data = re.sub(r"\n +#.*\n", "\n", data)
data = re.sub(r"',\s+fail\s+[^,]+,", "", data)
data = re.sub(r"',\s+matchtype\s+[^,]+,", "", data)
data = re.sub(r"',\s+inputstring\s+=> '", ",", data)
data = re.sub(r"\s+matchstring\s+=> '", ",", data)
data = re.sub(r"\n        },", "',", data)
open('output.txt', 'w').write(data)

output.txt

[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0

[macro]
%::WebSurfRules =
    (
    '1'     => 'NAVIGATE,http://www.tda-sgft.com/TdaWeb/jsp/fondos/Fondos.tda',',
    '2'     => 'CLICK_REFERENCE,phHttpDest->\{\'FirstClick\'\}',
    '3'     => 'CLICK_REFERENCE,phHttpDest->\{\'SecondClick\'\}',',
    '4'     => 'CLICK_REFERENCE,phHttpDest->\{\'DealClick\'\}',
    '5'     => 'CLICK_REFERENCE,phHttpDest->\{\'LinkClick2\'\}',
    '6'     => 'CLICK_REFERENCE,phHttpDest->\{\'DocClick\'\}',',
    '7'     => 'CLICK_DOWNLOAD_OK',',
    );

...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM