简体   繁体   中英

Repeating delimiters and extracting the string between those

I am fairly new to Python and to regular expressions, and am looking to extract information from an html file.

Assume the following is a line given in the html file (since html doesn't "see" whitespace, our example is on the same line)

<td (some possible parameters)> EXTRACT_THIS </td> <td (some possible parameters)> ALSO_EXTRACT_THIS </td>

In my current code:

with open(myInput, 'r') as inputFile:
    for line in inputFile:
        line = line.strip()

        if line != '':

            m = re.findall('<td.*>(.*?)</td>', line)
            if m:
                #strip() again
                print(m)

This will only print:

['ALSO_EXTRACT_THIS']

instead of my desired

['EXTRACT_THIS', 'ALSO_EXTRACT_THIS']

Is there something I am doing wrong? I've looked into it and it seems that this is a way for to extract multiple substrings with repeating delimiters.

This is because the <td.*> will match the longest string it can, which is <td (some possible parameters)> EXTRACT_THIS </td> <td (some possible parameters)> .

You should use non-greedy quantifier for the <td> too:

'<td.*?>(.*?)</td>'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM