I want to extract info from .cdp file ( content downloader file, program to parsing, can be opened in notepad) file looks like:
....
...
<CD_PARSING_RB_9>0</CD_PARSING_RB_9>
<CD_PARSING_RB_F9_3>0</CD_PARSING_RB_F9_3>
<CD_PARSING_LB_1>http://www.prospect.chisites.net/opportunities/?pageno=1
http://www.prospect.chisites.net/opportunities/?pageno=2
http://www.prospect.chisites.net/opportunities/?pageno=3
http://www.prospect.chisites.net/opportunities/?pageno=4</CD_PARSING_LB_1>
<CD_PARSING_EDIT_26>0</CD_PARSING_EDIT_26>
<CD_PARSING_EDIT_27><a href=/jobs/</CD_PARSING_EDIT_27>
<CD_PARSING_EDIT_28>0</CD_PARSING_EDIT_28>
I want to extract links using python, I found some solution bt it works partially. (just deletes <CD_PARSING_LB1>
tags), it should delete everything but the links between those two tags. Solution may be also using search, but this one wouldn't work for some reason.
code:
import string
import codecs
import re
import glob
outfile = open('newout.txt', 'w+')
try:
for file in glob.glob("*.cdp"):
print(file)
infile = open(file, 'r')
step1 = re.sub('.*<CD_PARSING_LB_1>', '',infile.read(), re.DOTALL)
step2 = re.sub('</CD_PARSING_LB_1>.*','', step1, re.DOTALL)
outfile.write(str(step1))
except Exception as ex:
print ex
raw_input()
Please help me in any way to get those links separated... Thanks full file example:
Content Downloader X1 (11.9940) project file (parsing)
<F68_CB_5>0</F68_CB_5>
<F68_CB_8>0</F68_CB_8>
<F34_CB_4>0</F34_CB_4>
<F70_CB_4>0</F70_CB_4>
<F34_CB_5>0</F34_CB_5>
<F34_SE_1>0</F34_SE_1>
<F82_SE_2>0</F82_SE_2>
<F69_SE_1>1</F69_SE_1>
<F1_CMBO_8>0</F1_CMBO_8>
<F105_MEMO_1></F105_MEMO_1>
<F9_RBN_01>2</F9_RBN_01>
<F96_RB_01>1</F96_RB_01>
<F1_RBN_15>1</F1_RBN_15>
<F1_N120>1</F1_N120>
<F64_CB_01>0</F64_CB_01>
<F64_RB_01>1</F64_RB_01>
<F70_CB_03>0</F70_CB_03>
<CD_PARSING_COMBO_5>0</CD_PARSING_COMBO_5>
<F64_CB_02>0</F64_CB_02>
<F60_CB_02>0</F60_CB_02>
<F64_RE_1></F64_RE_1>
<F95_M_1></F95_M_1>
<F1_COMBO_6>0</F1_COMBO_6>
<F40_CHCKBX_555>0</F40_CHCKBX_555>
<F09_CB_01>0</F09_CB_01>
<F48_CB_02>0</F48_CB_02>
<F68_CB_01>0</F68_CB_01>
<F68_CB_02>0</F68_CB_02>
<F68_CB_03>0</F68_CB_03>
<F57_CB_41>0</F57_CB_41>
<F57_CB_43>0</F57_CB_43>
<F57_CB_45>0</F57_CB_45>
<F57_CB_47>0</F57_CB_47>
<F57_CB_49>0</F57_CB_49>
<F57_CB_51>0</F57_CB_51>
<F57_CB_53>0</F57_CB_53>
<F57_CB_55>0</F57_CB_55>
<F57_CB_57>0</F57_CB_57>
<F57_CB_59>0</F57_CB_59>
<F57_CB_61>0</F57_CB_61>
<F57_CB_63>0</F57_CB_63>
<F57_CB_65>0</F57_CB_65>
<F57_CB_67>0</F57_CB_67>
<F57_CB_69>0</F57_CB_69>
<F57_CB_71>0</F57_CB_71>
<F57_CB_73>0</F57_CB_73>
<F57_CB_75>0</F57_CB_75>
<F57_CB_77>0</F57_CB_77>
<F57_CB_79>0</F57_CB_79>
<F57_CB_42>0</F57_CB_42>
<F57_CB_44>0</F57_CB_44>
<F57_CB_46>0</F57_CB_46>
<F57_CB_48>0</F57_CB_48>
<F57_CB_50>0</F57_CB_50>
<F57_CB_52>0</F57_CB_52>
<CD_PARSING_EDIT_93>0</CD_PARSING_EDIT_93>
<CD_PARSING_EDIT_94></CD_PARSING_EDIT_94>
<CD_PARSING_EDIT_57_12></CD_PARSING_EDIT_57_12>
<CD_PARSING_EDIT_57_13></CD_PARSING_EDIT_57_13>
<CD_PARSING_EDIT_57_14></CD_PARSING_EDIT_57_14>
<CD_PARSING_EDIT_57_15></CD_PARSING_EDIT_57_15>
<CD_PARSING_EDIT_57_16></CD_PARSING_EDIT_57_16>
<CD_PARSING_EDIT_57_17></CD_PARSING_EDIT_57_17>
<CD_PARSING_EDIT_57_18></CD_PARSING_EDIT_57_18>
<CD_PARSING_RICH_50_1>[VALUE]</CD_PARSING_RICH_50_1>
<CD_PARSING_EDIT_F9_13>3</CD_PARSING_EDIT_F9_13>
<CD_PARSING_EDIT_F9_18>http://sitename.com</CD_PARSING_EDIT_F9_18>
<CD_PARSING_EDIT_F24_2>1</CD_PARSING_EDIT_F24_2>
<CD_PARSING_EDIT_F48_1></CD_PARSING_EDIT_F48_1>
<CD_PARSING_EDIT_F48_2>10</CD_PARSING_EDIT_F48_2>
<CD_PARSING_EDIT_F48_5>0</CD_PARSING_EDIT_F48_5>
<CD_PARSING_EDIT_F48_3>0</CD_PARSING_EDIT_F48_3>
<CD_PARSING_EDIT_F56_1></CD_PARSING_EDIT_F56_1>
<CD_PARSING_EDIT_F56_2>-</CD_PARSING_EDIT_F56_2>
<CD_PARSING_EDIT_F34_1></CD_PARSING_EDIT_F34_1>
<CD_PARSING_EDIT_F34_3>http://</CD_PARSING_EDIT_F34_3>
<CD_PARSING_EDIT_F40_2>Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 sputnik 2.1.0.18 YB/4.3.0</CD_PARSING_EDIT_F40_2>
<CD_PARSING_EDIT_F46_1></CD_PARSING_EDIT_F46_1>
<CD_PARSING_M49_1> class="entry"
id="news-id-
id="article-text"
</CD_PARSING_M49_1>
<CD_PARSING_M48_1></CD_PARSING_M48_1>
<F90_M_1></F90_M_1>
<CD_PARSING_M48_3></CD_PARSING_M48_3>
<CD_PARSING_SYN_F46_1><CD_CYCLE_GRAN_ALL!></CD_PARSING_SYN_F46_1>
<CD_PARSING_RICH_F9_1></CD_PARSING_RICH_F9_1>
<CD_PARSING_RICH_F9_2></CD_PARSING_RICH_F9_2>
<CD_PARSING_R24_1>0</CD_PARSING_R24_1>
<F1_COMBOBOX_9>0</F1_COMBOBOX_9>
<F1_COMBOBOX_10>2</F1_COMBOBOX_10>
<CD_PARSING_RB_9>0</CD_PARSING_RB_9>
<CD_PARSING_RB_F9_3>0</CD_PARSING_RB_F9_3>
<CD_PARSING_LB_1>http://www.latestvacancies.com/wates/</CD_PARSING_LB_1>
<CD_PARSING_EDIT_26>0</CD_PARSING_EDIT_26>
<CD_PARSING_EDIT_27>Jobs/Advert/</CD_PARSING_EDIT_27>
<CD_PARSING_EDIT_28>0</CD_PARSING_EDIT_28>
<CD_PARSING_EDIT_29>?</CD_PARSING_EDIT_29>
<CD_PARSING_COMBOBOX_1>csv</CD_PARSING_COMBOBOX_1>
<CD_PARSING_RE61_1></CD_PARSING_RE61_1>
<CD_PARSING_CHECK_61_1>1</CD_PARSING_CHECK_61_1>
<CD_PARSING_RB60_1>1</CD_PARSING_RB60_1>
<CD_PARSING_SE60_1>1</CD_PARSING_SE60_1>
Try this.
with
statement to read/write files. file
is a builtin class, use something like ifile
. http:[^<]*
. import string
import codecs
import re
import glob
with open('newout.txt', 'w+') as outfile:
try:
for ifile in glob.glob("*.cdp"):
print (ifile)
with open(ifile, 'r') as infile:
for line in infile:
step1 = re.findall(r'(http:[^<]+)', line)
if len(step1) > 0:
outfile.write("%s\n" % step1[0].strip())
except Exception as ex:
print (ex)
outfile = open('newout.txt', 'w+')
try:
for file in glob.glob("*.cdp"):
print(file)
infile = open(file, 'r')
step1 = re.sub(re.compile('.*[<]CD_PARSING_LB_1[>]', re.DOTALL), '',infile.read())
step2 = re.sub(re.compile('[<]/CD_PARSING_LB_1[>].*', re.DOTALL),'', step1)
outfile.write(str(step2))
except Exception as ex:
print ex
raw_input()
try this. the four argument of re.sub is count
, not flag.
and I think using result = re.search('[<]tag1[>](.*)[<]/tag1[>])
, and get the links by result.group(1)
might be easier.
Use this regex pattern.
String Pattern= "(http:.*=\\d{1, 7})";
See demo here https://regex101.com/r/fI3eT4/1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.