I am looking for repeating patterns inside an HTML page.
The patterns I am interested in start after the prefix "<h2>Seasons</h2>"
The same patterns occur before the prefix too, I am not interested in those.
I tried (and failed) with the following python code (I simplified the pattern to '<a href=.+?</a>' for the sake of making this question readable):
matches = re.compile('<h2>Seasons</h2>.+?(<a href=.+?</a>)+',re.DOTALL).findall(page)
for ref in matches
print ref
Given the page:
blah blah html stuff
<h2>Seasons</h2>
blah blah more html stuff
<a href=http://www.111.com>111</a><a href=http://www.222.com>222</a><a href=http://www.333.com>333</a>
The output is
<a href=http://www.333.com>333</a>
So it only prints the last match, the other two do not make it to the findall list. How do I do to iterate over all matches of the groups?
The problem is that the regex matches only a single time. The parenthesized group matches multiple times, but the regex as a whole only matches once. This means only one match is returned, the last one.
To get around this you need to write a regex that matches multiple times. You might think to use a lookbehind assertion for the <h2>
element like so:
(?<=<h2>Seasons</h2>.+?)(<a href=.+?</a>) # doesn't work
This says to find <a>
elements, but only if they're preceded by <h2>Seasons</h2>
. Unfortunately lookbehind strings have to be of fixed length. You can't put .+?
in a lookbehind assertion. So that approach is out.
Next up is to find the location of the <h2>
element first, then perform the regex search starting from there.
>>> re.findall('<a href=.+?</a>', page[page.find('<h2>Seasons</h2>'):], re.DOTALL)
['<a href=http://www.111.com>111</a>', '<a href=http://www.222.com>222</a>', '<a href=http://www.333.com>333</a>']
You should use an html parser like BeautifulSoup ; will make your life a lot easier.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.