Suppose I have an HTML snippet like this:
<div>
Hello <strong>There</strong>
<div>I think <em>I am</em> feeing better!</div>
<div>Don't you?</div>
Yup!
</div>
What's the best/most robust way to remove the surrounding root element, so it looks like this:
Hello <strong>There</strong>
<div>I think <em>I am</em> feeing better!</div>
<div>Don't you?</div>
Yup!
I've tried using lxml.html like this:
lxml.html.fromstring(fragment_string).drop_tag()
But that only gives me "Hello", which I guess makes sense. Any better ideas?
This is a bit odd in lxml (or ElementTree). You'd have to do:
def inner_html(el):
return (el.text or '') + ''.join(tostring(child) for child in el)
Note that lxml (and ElementTree) have no special way to represent a document except rooted with a single element, but .drop_tag()
would work like you want if that <div>
wasn't the root element.
You can use BeautifulSoup package. For this particular html I would go like this:
import BeautifulSoup
html = """<div>
Hello <strong>There</strong>
<div>I think <em>I am</em> feeing better!</div>
<div>Don't you?</div>
Yup!
</div>"""
bs = BeautifulSoup.BeautifulSoup(html)
no_root = '\n'.join(map(unicode, bs.div.contents))
BeautifulSoup has many nice features that will allow you to tweak this example for many other cases. Full documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html .
For such a simple task you can use regexp like r'<(.*?)>(.*)</\\1>'
and get match #2 (\\2 in perl terms) from it
You should also put flags like ms
for correct multi-line working
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.