简体   繁体   中英

python re, multiple matching groups

I have a string:

s = '&nbsp;<span>Mil<\/span><\/th><td align=\"right\" headers=\"Y0 i7\">112<\/td><td align=\"right\" headers=\"Y1 i7\">113<\/td><td align=\"right\" headers=\"Y2 i7\">110<\/td><td align=\"right\" headers=\"Y3 i7\">107<\/td><td align=\"right\" headers=\"Y4 i7\">105<\/td><td align=\"right\" headers=\"Y5 i7\">95<\/td><td align=\"right\" headers=\"Y6 i7\">95<\/td><td align=\"right\" headers=\"Y7 i7\">87<\/td><td align=\"right\" headers=\"Y8 i7\">77<\/td><td align=\"right\" headers=\"Y9 i7\">74<\/td><td align=\"right\" headers=\"Y10 i7\">74<\/td><\/tr>'

I want to extract these numbers from the string:

112 113 110 107 105 95 95 87 77 74 74

I am no expert on regular expressions, so can anyone tell me, why this isn't returning any matches:

p = re.compile(r'&nbsp;.*(>\d*<\\/td>.*)*<\\/tr>')
m = p.match(s)

I'm sure there is an html/xml parsing module that can solve my problem and I could also just split the string and work on that output, but I really want to do it with the re module. Thanks!

>>> r = re.compile(r'headers="Y\d+ i\d+">(\d+)<\\/td>')
>>> r.findall(s)
['112', '113', '110', '107', '105', '95', '95', '87', '77', '74', '74']
>>> 

All of the numbers you want are in between ">" and "<". So, you can just do this:

re.findall(">(\d+)<", s)

output:

['112', '113', '110', '107', '105', '95', '95', '87', '77', '74', '74']

Basically, it's saying get every stream of digits that is between ">" and "<". Then, with set , you can get only the unique ones.

The other answers give regexes that will work, but it's worth understanding why your regex doesn't.

All of your matches are both greedy and optional ( * ). So your regex says:

  • &nbsp;
  • 0 or more characters of anything
  • 0 or more occurrences of your capture group
  • </tr>

"0 or more characters of anything" eats the rest of the string, leaving nothing for the capture group, and since it's optional, that successfully matches.

If you wanted to redesign your regex to work, you would want to use .*? instead of .* to match the junk at the beginning of the string. The ? makes the match nongreedy, so that it will match as few characters as possible rather than as many as possible.

Your expression isn't returning any matches because i wrote it a bit wrong. Instead of print:

p = re.compile(r'&nbsp;.*(>\d*<\\/td>.*)*<\\/tr>')
m = p.match(s) 

You probably should print this:

>>> p = re.compile(r'headers="Y\d+ i\d+">(\d+)<\\/td>')
>>> p.findall(s)
['112', '113', '110', '107', '105', '95', '95', '87', '77', '74', '74'] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM