Repeating delimiters and extracting the string between those

Question

I am fairly new to Python and to regular expressions, and am looking to extract information from an html file.

Assume the following is a line given in the html file (since html doesn't "see" whitespace, our example is on the same line)

<td (some possible parameters)> EXTRACT_THIS </td> <td (some possible parameters)> ALSO_EXTRACT_THIS </td>

In my current code:

with open(myInput, 'r') as inputFile:
    for line in inputFile:
        line = line.strip()

        if line != '':

            m = re.findall('<td.*>(.*?)</td>', line)
            if m:
                #strip() again
                print(m)

This will only print:

['ALSO_EXTRACT_THIS']

instead of my desired

['EXTRACT_THIS', 'ALSO_EXTRACT_THIS']

Is there something I am doing wrong? I've looked into it and it seems that this is a way for to extract multiple substrings with repeating delimiters.

Answer 1

This is because the <td.*> will match the longest string it can, which is <td (some possible parameters)> EXTRACT_THIS </td> <td (some possible parameters)> .

You should use non-greedy quantifier for the <td> too:

'<td.*?>(.*?)</td>'

Repeating delimiters and extracting the string between those

Question

1 answers

solution1
0 ACCPTED 2019-12-02 02:13:00

Repeating delimiters and extracting the string between those

Question

1 answers

solution1 0 ACCPTED 2019-12-02 02:13:00

solution1
0 ACCPTED 2019-12-02 02:13:00