简体   繁体   中英

Beautiful Soup / regex matching over multiple lines

I have basically an RSS indexing app written in Python that stores the RSS content as a blurb in the DB. When the app initially processed the article contents, it commented out all links that didn't match certain criteria, for example:

<a href="http://google.com">Google</a>

Became:

<!--<a href="http://google.com">Google</a>--> Google

Now I need to process all these old articles and modify the links. So using BeautifulSoup 4 I can easily find the comments using:

links = soup.findAll(text=lambda text:isinstance(text, Comment))
for link in links:
    text = re.sub('<[^>]*>', '', link.string)
    # any html in the link tag was escaped by BS4, so need to convert back
    text = text.replace('&amp;lt;','<')
    text = text.replace('&amp;gt;','>')
    find = link.string + " " + text

The ouput of "find" above is:

<!--<a href="http://google.com">Google</a>--> Google

Which makes it easier to perform a .replace() on the content.

Now the problem I'm having (and I'm sure this is simple) is multi-line find/replacing. When Beautiful Soup initial commented out the links, some were converted to:

<!--<a href="http://google.com">Google
</a>--> Google

or

<!--<a href="http://google.com">Google</a>--> 
Google

So obviously, replace(old,new) won't work since replace() doesn't cover multi-lines.

Can someone help me out with a regex multi-line find/replace? It should be case-sensitive.

Try this:

 re.sub(r'pattern', '', link, flags=re.MULTILINE)

Regex matching is case sensitive per default.

If for some reason the RSS file becomes irregular, your script will fail. In that case you should consider using a proper parser, for instance lxml .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM