简体   繁体   中英

Python Regex for html tags

I´m trying to get rid of some elements of the HTML code before using an html parser. I´m pretty new to regex and thats why I have problems understanding the syntax.

Parts of my html-code look like this:

<div class="footer" id="footer">
 <other tags> ... bla ... </other tags>
</div>

But it appears that the same "part" of the page can be written differently on a certain sub-page, like this:

<div id="footer" class="footer">
 <other tags> ... bla ... </other tags>
</div>

The thing I achieved is to get rid of specific cases:

footer = re.sub('<div class="footer" id="footer">.*?</div>','',html)

But what I want is a Regex that is more general, so if he should get rid of every the parts when, eg "id="footer" no matter whats in front or behind it

<div ... id="footer" ...> 
<other tags> ... bla ... </other tags>    
</div> 

EDIT: before getting "hated", I´m pretty new to HTML parsers too.

Thanks for the help!

MG

Why would you want to remove it? As Bhavesh said just select the ones which you want. But if you want to know if we can remove them then yes you can get rid of them by decompose()

a="""
<div class="footer" id="footer">
 <p>lskjdf</p>
</div>

<div id="not_footer" class="footer">
<p>lskjdf</p>
</div>
"""
b = BeautifulSoup(a)
print b
print '---------------------'
print '---------------------'
for c in b.select('div#footer'):
    c.decompose()
print b

Output:

<html><body><div class="footer" id="footer">
<p>lskjdf</p>
</div>
<div class="footer" id="not_footer">
<p>lskjdf</p>
</div>
</body></html>
---------------------
---------------------
<html><body>
<div class="footer" id="not_footer">
<p>lskjdf</p>
</div>
</body></html>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM