简体   繁体   中英

How to find all occurrences when a prefix is present

I am looking for repeating patterns inside an HTML page.
The patterns I am interested in start after the prefix "<h2>Seasons</h2>"
The same patterns occur before the prefix too, I am not interested in those.

I tried (and failed) with the following python code (I simplified the pattern to '<a href=.+?</a>' for the sake of making this question readable):

matches = re.compile('<h2>Seasons</h2>.+?(<a href=.+?</a>)+',re.DOTALL).findall(page)  
for ref in matches  
   print ref

Given the page:

blah blah html stuff 
<h2>Seasons</h2>  
blah blah  more html stuff
<a href=http://www.111.com>111</a><a href=http://www.222.com>222</a><a href=http://www.333.com>333</a>

The output is

<a href=http://www.333.com>333</a>  

So it only prints the last match, the other two do not make it to the findall list. How do I do to iterate over all matches of the groups?

The problem is that the regex matches only a single time. The parenthesized group matches multiple times, but the regex as a whole only matches once. This means only one match is returned, the last one.

To get around this you need to write a regex that matches multiple times. You might think to use a lookbehind assertion for the <h2> element like so:

(?<=<h2>Seasons</h2>.+?)(<a href=.+?</a>)    # doesn't work

This says to find <a> elements, but only if they're preceded by <h2>Seasons</h2> . Unfortunately lookbehind strings have to be of fixed length. You can't put .+? in a lookbehind assertion. So that approach is out.

Next up is to find the location of the <h2> element first, then perform the regex search starting from there.

>>> re.findall('<a href=.+?</a>', page[page.find('<h2>Seasons</h2>'):], re.DOTALL)
['<a href=http://www.111.com>111</a>', '<a href=http://www.222.com>222</a>', '<a href=http://www.333.com>333</a>']

You should use an html parser like BeautifulSoup ; will make your life a lot easier.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM