I am trying to parse the elements of an html list which looks like this:
<ol>
<li>r1</li>
<li>r2
<ul>
<li>n1</li>
<li>n2</li>
</ul>
</li>
<li>r3
<ul>
<li>d1
<ol>
<li>e1</li>
<li>e2</li>
</ol>
</li>
<li>d2</li>
</ul>
</li>
<li>r4</li>
</ol>
I am fine with parsing this for the most part, but the biggest problem for me is in getting the dom text back. Unfortunately lxml's node.text_content() returns the text form of the complete tree under it. Can I obtain the text content of just that element using lxml, or would I need to use string manipulation or regex for that?
For eg: the node with d1 returns "d1e1e2", whereas, I want it to return just d1.
Each node has an attribute called text
. That's what you are looking for.
eg:
for node in root.iter("*"):
print node.text
# print node.tail # e.g.: <div> <span> abc </span> def </div> => abc def
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.