Python multiline regex from text to text MULTILINE

Question

I want to extract info from .cdp file ( content downloader file, program to parsing, can be opened in notepad) file looks like:

....
...
<CD_PARSING_RB_9>0</CD_PARSING_RB_9>
<CD_PARSING_RB_F9_3>0</CD_PARSING_RB_F9_3>
<CD_PARSING_LB_1>http://www.prospect.chisites.net/opportunities/?pageno=1
http://www.prospect.chisites.net/opportunities/?pageno=2
http://www.prospect.chisites.net/opportunities/?pageno=3
http://www.prospect.chisites.net/opportunities/?pageno=4</CD_PARSING_LB_1>
<CD_PARSING_EDIT_26>0</CD_PARSING_EDIT_26>
<CD_PARSING_EDIT_27><a href=/jobs/</CD_PARSING_EDIT_27>
<CD_PARSING_EDIT_28>0</CD_PARSING_EDIT_28>

I want to extract links using python, I found some solution bt it works partially. (just deletes <CD_PARSING_LB1> tags), it should delete everything but the links between those two tags. Solution may be also using search, but this one wouldn't work for some reason.

code:

import string
import codecs
import re
import glob




outfile = open('newout.txt', 'w+')
try:
    for file in glob.glob("*.cdp"):
        print(file)
        infile = open(file, 'r')
        step1 = re.sub('.*<CD_PARSING_LB_1>', '',infile.read(), re.DOTALL)
        step2 = re.sub('</CD_PARSING_LB_1>.*','', step1, re.DOTALL)

        outfile.write(str(step1))
except Exception as ex:
    print ex
    raw_input()

Please help me in any way to get those links separated... Thanks full file example:

Content Downloader X1 (11.9940) project file (parsing)
    <F68_CB_5>0</F68_CB_5>
    <F68_CB_8>0</F68_CB_8>
    <F34_CB_4>0</F34_CB_4>
    <F70_CB_4>0</F70_CB_4>
    <F34_CB_5>0</F34_CB_5>
    <F34_SE_1>0</F34_SE_1>
    <F82_SE_2>0</F82_SE_2>
    <F69_SE_1>1</F69_SE_1>
    <F1_CMBO_8>0</F1_CMBO_8>
    <F105_MEMO_1></F105_MEMO_1>
    <F9_RBN_01>2</F9_RBN_01>
    <F96_RB_01>1</F96_RB_01>
    <F1_RBN_15>1</F1_RBN_15>
    <F1_N120>1</F1_N120>
    <F64_CB_01>0</F64_CB_01>
    <F64_RB_01>1</F64_RB_01>
    <F70_CB_03>0</F70_CB_03>
    <CD_PARSING_COMBO_5>0</CD_PARSING_COMBO_5>
    <F64_CB_02>0</F64_CB_02>
    <F60_CB_02>0</F60_CB_02>
    <F64_RE_1></F64_RE_1>
    <F95_M_1></F95_M_1>
    <F1_COMBO_6>0</F1_COMBO_6>
    <F40_CHCKBX_555>0</F40_CHCKBX_555>
    <F09_CB_01>0</F09_CB_01>
    <F48_CB_02>0</F48_CB_02>
    <F68_CB_01>0</F68_CB_01>
    <F68_CB_02>0</F68_CB_02>
    <F68_CB_03>0</F68_CB_03>
    <F57_CB_41>0</F57_CB_41>
    <F57_CB_43>0</F57_CB_43>
    <F57_CB_45>0</F57_CB_45>
    <F57_CB_47>0</F57_CB_47>
    <F57_CB_49>0</F57_CB_49>
    <F57_CB_51>0</F57_CB_51>
    <F57_CB_53>0</F57_CB_53>
    <F57_CB_55>0</F57_CB_55>
    <F57_CB_57>0</F57_CB_57>
    <F57_CB_59>0</F57_CB_59>
    <F57_CB_61>0</F57_CB_61>
    <F57_CB_63>0</F57_CB_63>
    <F57_CB_65>0</F57_CB_65>
    <F57_CB_67>0</F57_CB_67>
    <F57_CB_69>0</F57_CB_69>
    <F57_CB_71>0</F57_CB_71>
    <F57_CB_73>0</F57_CB_73>
    <F57_CB_75>0</F57_CB_75>
    <F57_CB_77>0</F57_CB_77>
    <F57_CB_79>0</F57_CB_79>
    <F57_CB_42>0</F57_CB_42>
    <F57_CB_44>0</F57_CB_44>
    <F57_CB_46>0</F57_CB_46>
    <F57_CB_48>0</F57_CB_48>
    <F57_CB_50>0</F57_CB_50>
    <F57_CB_52>0</F57_CB_52>
    <CD_PARSING_EDIT_93>0</CD_PARSING_EDIT_93>
    <CD_PARSING_EDIT_94></CD_PARSING_EDIT_94>
    <CD_PARSING_EDIT_57_12></CD_PARSING_EDIT_57_12>
    <CD_PARSING_EDIT_57_13></CD_PARSING_EDIT_57_13>
    <CD_PARSING_EDIT_57_14></CD_PARSING_EDIT_57_14>
    <CD_PARSING_EDIT_57_15></CD_PARSING_EDIT_57_15>
    <CD_PARSING_EDIT_57_16></CD_PARSING_EDIT_57_16>
    <CD_PARSING_EDIT_57_17></CD_PARSING_EDIT_57_17>
    <CD_PARSING_EDIT_57_18></CD_PARSING_EDIT_57_18>
    <CD_PARSING_RICH_50_1>[VALUE]</CD_PARSING_RICH_50_1>
    <CD_PARSING_EDIT_F9_13>3</CD_PARSING_EDIT_F9_13>
    <CD_PARSING_EDIT_F9_18>http://sitename.com</CD_PARSING_EDIT_F9_18>
    <CD_PARSING_EDIT_F24_2>1</CD_PARSING_EDIT_F24_2>
    <CD_PARSING_EDIT_F48_1></CD_PARSING_EDIT_F48_1>
    <CD_PARSING_EDIT_F48_2>10</CD_PARSING_EDIT_F48_2>
    <CD_PARSING_EDIT_F48_5>0</CD_PARSING_EDIT_F48_5>
    <CD_PARSING_EDIT_F48_3>0</CD_PARSING_EDIT_F48_3>
    <CD_PARSING_EDIT_F56_1></CD_PARSING_EDIT_F56_1>
    <CD_PARSING_EDIT_F56_2>-</CD_PARSING_EDIT_F56_2>
    <CD_PARSING_EDIT_F34_1></CD_PARSING_EDIT_F34_1>
    <CD_PARSING_EDIT_F34_3>http://</CD_PARSING_EDIT_F34_3>
    <CD_PARSING_EDIT_F40_2>Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 sputnik 2.1.0.18 YB/4.3.0</CD_PARSING_EDIT_F40_2>
    <CD_PARSING_EDIT_F46_1></CD_PARSING_EDIT_F46_1>
    <CD_PARSING_M49_1> class="entry"
     id="news-id-
     id="article-text"
    </CD_PARSING_M49_1>
    <CD_PARSING_M48_1></CD_PARSING_M48_1>
    <F90_M_1></F90_M_1>
    <CD_PARSING_M48_3></CD_PARSING_M48_3>
    <CD_PARSING_SYN_F46_1><CD_CYCLE_GRAN_ALL!></CD_PARSING_SYN_F46_1>
    <CD_PARSING_RICH_F9_1></CD_PARSING_RICH_F9_1>
    <CD_PARSING_RICH_F9_2></CD_PARSING_RICH_F9_2>
    <CD_PARSING_R24_1>0</CD_PARSING_R24_1>
    <F1_COMBOBOX_9>0</F1_COMBOBOX_9>
    <F1_COMBOBOX_10>2</F1_COMBOBOX_10>
    <CD_PARSING_RB_9>0</CD_PARSING_RB_9>
    <CD_PARSING_RB_F9_3>0</CD_PARSING_RB_F9_3>
    <CD_PARSING_LB_1>http://www.latestvacancies.com/wates/</CD_PARSING_LB_1>
    <CD_PARSING_EDIT_26>0</CD_PARSING_EDIT_26>
    <CD_PARSING_EDIT_27>Jobs/Advert/</CD_PARSING_EDIT_27>
    <CD_PARSING_EDIT_28>0</CD_PARSING_EDIT_28>
    <CD_PARSING_EDIT_29>?</CD_PARSING_EDIT_29>
    <CD_PARSING_COMBOBOX_1>csv</CD_PARSING_COMBOBOX_1>
    <CD_PARSING_RE61_1></CD_PARSING_RE61_1>
    <CD_PARSING_CHECK_61_1>1</CD_PARSING_CHECK_61_1>
    <CD_PARSING_RB60_1>1</CD_PARSING_RB60_1>
    <CD_PARSING_SE60_1>1</CD_PARSING_SE60_1>

Answer 1

Try this.

use with statement to read/write files.
the file is a builtin class, use something like ifile .
the regex only need to match the pattern http:[^<]* .

code

import string
import codecs
import re
import glob
   with open('newout.txt', 'w+') as outfile:
    try:
        for ifile in glob.glob("*.cdp"):
            print (ifile)
            with  open(ifile, 'r') as infile:
                for line in infile:
                    step1 = re.findall(r'(http:[^<]+)', line)
                    if len(step1) > 0:
                        outfile.write("%s\n" % step1[0].strip())

    except Exception as ex:
        print (ex)

Answer 2

outfile = open('newout.txt', 'w+')
try:
    for file in glob.glob("*.cdp"):
        print(file)
        infile = open(file, 'r')
        step1 = re.sub(re.compile('.*[<]CD_PARSING_LB_1[>]', re.DOTALL), '',infile.read())
        step2 = re.sub(re.compile('[<]/CD_PARSING_LB_1[>].*', re.DOTALL),'', step1)

        outfile.write(str(step2))
except Exception as ex:
    print ex
    raw_input()

try this. the four argument of re.sub is count , not flag.

and I think using result = re.search('[<]tag1[>](.*)[<]/tag1[>]) , and get the links by result.group(1) might be easier.

Answer 3

Use this regex pattern.

String Pattern= "(http:.*=\\d{1, 7})";

See demo here https://regex101.com/r/fI3eT4/1

Python multiline regex from text to text MULTILINE

Question

3 answers

solution1
0 2015-09-17 09:35:28

code

solution2
0 2015-09-17 09:56:03

solution3
0 2015-09-17 10:00:25

Python multiline regex from text to text MULTILINE

Question

3 answers

solution1 0 2015-09-17 09:35:28

code

solution2 0 2015-09-17 09:56:03

solution3 0 2015-09-17 10:00:25

solution1
0 2015-09-17 09:35:28

solution2
0 2015-09-17 09:56:03

solution3
0 2015-09-17 10:00:25