I know there were similar questions to this, but since they didn't solve the problem, please bear with me why I go through the issue one more time.
Here's my string:
normal = """
<p>
<b>
<a href='link1'> Forget me </a>
</b> I need this one <br>
<b>
<a href='link2'> Forget me too </a>
</b> Forget me not <i>even when</i> you go to sleep <br>
<b> <a href='link3'> Forget me three </a>
</b> Foremost on your mind <br>
</p>
"""
I start with:
target = lxml.html.fromstring(normal)
tree_struct = etree.ElementTree(target)
Now, I basically need to ignore everything anchored by the <a>
tag. But if I run this code:
for e in target.iter():
item = target.xpath(tree_struct.getpath(e))
if len(item)>0:
print(item[0].text)
I get nothing; if, on the other hand, I change the print
instruction to:
print(item[0].text_content())
I get this output:
Forget me
I need this one
Forget me too
Forget me not
even when
you go to sleep
Forget me three
Foremost on your mind
While my desired output is:
I need this one
Forget me not
even when
you go to sleep
Foremost on your mind
Aside for giving the wrong output, it's also inelegant. So I must be missing something obvious, though I can't figure out what.
I think you are making this unnecessarily complicated. There is no need to create the tree_struct
object and use getpath()
. Here is a suggestion:
from lxml import html
normal = """
<p>
<b>
<a href='link1'> Forget me </a>
</b> I need this one <br>
<b>
<a href='link2'> Forget me too </a>
</b> Forget me not <i>even when</i> you go to sleep <br>
<b> <a href='link3'> Forget me three </a>
</b> Foremost on your mind <br>
</p>
"""
target = html.fromstring(normal)
for e in target.iter():
if not e.tag == "a":
# Print text content if not only whitespace
if e.text and e.text.strip():
print(e.text.strip())
# Print tail content if not only whitespace
if e.tail and e.tail.strip():
print(e.tail.strip())
Output:
I need this one
Forget me not
even when
you go to sleep
Foremost on your mind
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.