Python Regex for html tags

Question

I´m trying to get rid of some elements of the HTML code before using an html parser. I´m pretty new to regex and thats why I have problems understanding the syntax.

Parts of my html-code look like this:

<div class="footer" id="footer">
 <other tags> ... bla ... </other tags>
</div>

But it appears that the same "part" of the page can be written differently on a certain sub-page, like this:

<div id="footer" class="footer">
 <other tags> ... bla ... </other tags>
</div>

The thing I achieved is to get rid of specific cases:

footer = re.sub('<div class="footer" id="footer">.*?</div>','',html)

But what I want is a Regex that is more general, so if he should get rid of every the parts when, eg "id="footer" no matter whats in front or behind it

<div ... id="footer" ...> 
<other tags> ... bla ... </other tags>    
</div>

EDIT: before getting "hated", I´m pretty new to HTML parsers too.

Thanks for the help!

MG

Answer 1

Why would you want to remove it? As Bhavesh said just select the ones which you want. But if you want to know if we can remove them then yes you can get rid of them by decompose()

a="""
<div class="footer" id="footer">
 <p>lskjdf</p>
</div>

<div id="not_footer" class="footer">
<p>lskjdf</p>
</div>
"""
b = BeautifulSoup(a)
print b
print '---------------------'
print '---------------------'
for c in b.select('div#footer'):
    c.decompose()
print b

Output:

<html><body><div class="footer" id="footer">
<p>lskjdf</p>
</div>
<div class="footer" id="not_footer">
<p>lskjdf</p>
</div>
</body></html>
---------------------
---------------------
<html><body>
<div class="footer" id="not_footer">
<p>lskjdf</p>
</div>
</body></html>

Python Regex for html tags

Question

1 answers

solution1
1 ACCPTED 2017-01-03 13:26:28

Python Regex for html tags

Question

1 answers

solution1 1 ACCPTED 2017-01-03 13:26:28

solution1
1 ACCPTED 2017-01-03 13:26:28