I developed a custom system which simulates web activity, for example downloading files and such. I also have a custom file format to feed into this system. I am looking to change this old system which is written in perl to a newer system in python. But first i have to somehow parse the file.
There are certain fields in the file that I would like to parse, such as the [settings]
where I have any arguements for the system. I also have a [macro]
section which is the beginning of the important stuff (the steps, etc).
What i have trouble is parsing these sections have my system write it out in a different and much more simpler format (i have thousands of these files and I just want to write a generator to take the old file and write to a new format in a new file).
Old format:
[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0
[macro]
%::WebSurfRules =
(
'Step1' =>
{
action => 'NAVIGATE',
inputstring => 'http://www.tda-sgft.com/TdaWeb/jsp/fondos/Fondos.tda',
},
'Step2' =>
{
action => 'CLICK_REFERENCE',
matchtype => 'OUTER',
matchstring => 'phHttpDest->\{\'FirstClick\'\}',
pass => 'phHttpDest->\{\'Step2Pass\'\}',
},
'Step3' =>
{
action => 'CLICK_REFERENCE',
matchtype => 'OUTER',
matchstring => 'phHttpDest->\{\'SecondClick\'\}',
},
'Step4' =>
{
action => 'CLICK_REFERENCE',
matchtype => 'OUTER',
matchstring => 'phHttpDest->\{\'DealClick\'\}',
accept_multi_match => 'ANY_TOP_FIRST',
},
'Step5' =>
{
action => 'CLICK_REFERENCE',
matchtype => 'INNER',
matchstring => 'phHttpDest->\{\'LinkClick2\'\}',
fail => 'Step6',
# accept_multi_match => 'ANY_TOP_LAST',
},
'Step6' =>
{
action => 'CLICK_REFERENCE',
matchtype => 'INNER',
matchstring => 'phHttpDest->\{\'DocClick\'\}',
},
'Step7' =>
{
action => 'CLICK_DOWNLOAD_OK',
},
);
[data]
Print WebAddress______________ Destination_________________________________________________ FirstClick_________________ SecondClick________________ DealClick_________________________ LinkClick2________________________ DocClick___________________________________ PayInterval DueDay Step2Pass__________ QaRule_________________________________________________________________________________________________________________
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_apl.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Fund´s Allocation q1 Step3 qa_regexp=Report D?d?ate\\s+\\d\\d\/$MM{$n}\/$YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMES[$MM{$n}-1].+$YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPSHORTMONTHNAMES[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMESSPANISH[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
And what i want it to spit out:
[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0
[macro]
%::WebSurfRules =
(
'1' => 'NAVIGATE,phHttpDest->\{\'WebAddress\'\}',
'2' => 'CLICK_REFERENCE,phHttpDest->\{\'FirstClick\'\}',
'3' => 'CLICK_REFERENCE,phHttpDest->\{\'SecondClick\'\}',
'4' => 'CLICK_REFERENCE,phHttpDest->\{\'DealClick\'\}',
'5' => 'CLICK_REFERENCE,phHttpDest->\{\'LinkClick2\'\}',
'6' => 'CLICK_REFERENCE,phHttpDest->\{\'DocClick\'\}',
);
[data]
Print WebAddress______________ Destination_________________________________________________ FirstClick_________________ SecondClick________________ DealClick_________________________ LinkClick2________________________ DocClick___________________________________ PayInterval DueDay Step2Pass__________ QaRule_________________________________________________________________________________________________________________
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_apl.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Fund´s Allocation q1 Step3 qa_regexp=Report D?d?ate\\s+\\d\\d\/$MM{$n}\/$YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMES[$MM{$n}-1].+$YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPSHORTMONTHNAMES[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMESSPANISH[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
Where each of the clicks the phHttpDest and the action correlate to the Headings of the [data]
section.
So one way of doing it is using a set of regular expression replacements to create the files in the new format. I didn't completely understand the rules of your format so I generally implemented the whole thing, but there are some differences. You'll have to go in and make some adjustments to fine tune it. The output.txt file is what gets produced when one uses your example as input.txt
code
import re
data = open('input.txt').read()
data = re.sub(r" 'Step([0-9]+)' =>\s+{\s+action\s+=> ", r" '\1' => ", data)
data = re.sub(r"',\s+pass\s+[^,]+,", "", data)
data = re.sub(r"',\s+accept_multi_match\s+[^,]+,", "", data)
data = re.sub(r"\n +#.*\n", "\n", data)
data = re.sub(r"',\s+fail\s+[^,]+,", "", data)
data = re.sub(r"',\s+matchtype\s+[^,]+,", "", data)
data = re.sub(r"',\s+inputstring\s+=> '", ",", data)
data = re.sub(r"\s+matchstring\s+=> '", ",", data)
data = re.sub(r"\n },", "',", data)
open('output.txt', 'w').write(data)
output.txt
[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0
[macro]
%::WebSurfRules =
(
'1' => 'NAVIGATE,http://www.tda-sgft.com/TdaWeb/jsp/fondos/Fondos.tda',',
'2' => 'CLICK_REFERENCE,phHttpDest->\{\'FirstClick\'\}',
'3' => 'CLICK_REFERENCE,phHttpDest->\{\'SecondClick\'\}',',
'4' => 'CLICK_REFERENCE,phHttpDest->\{\'DealClick\'\}',
'5' => 'CLICK_REFERENCE,phHttpDest->\{\'LinkClick2\'\}',
'6' => 'CLICK_REFERENCE,phHttpDest->\{\'DocClick\'\}',',
'7' => 'CLICK_DOWNLOAD_OK',',
);
...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.