简体   繁体   中英

How do I use lxml and python to traverse the <body> of a html document along with its children

I would like to take an html document and traverse the <body> part of the document with its children. I see lots of examples to get a subtree via xpath or tag name but this doesn't seem to give the children.

import lxml
from lxml import html, etree  

html3 = "<html><head><title>test<body><h1>page title</h3><p>some text</p>"
root = lxml.html.fromstring(html3)
tree = etree.ElementTree(root)
for el in root.iter():
    # do something
    print(el.text, tree.getpath(el))

This will output

None /html
None /html/head
test /html/head/title
None /html/body
page title /html/body/h1
some text /html/body/p

I would like only

page title /html/body/h1
some text /html/body/p

Any help gratefully received.

I had similar difficulty, then I figured that each etree node has an iterator if its parent using which you can traverse

for instance, root here will give you the body using that you can iterate each element of body

from lxml import etree
parser = etree.HTMLParser()
tree   = etree.parse('yourdocument.html', parser)

root = tree.xpath('/html/body/')[0]
for i in root.getiterator():
    print(i.tag,i.text)

It seems that your html code has an invalid format, I just wrote a little program with beautifuSoup that maybe you can use to modify for your purpose:

from bs4 import BeautifulSoup
html3 = "<html><head><title>test</title></head><body><h1>page title</h1><p>some text</p><body></html>"
soup = BeautifulSoup(html3, "html5lib")
body = soup.find('body')

for item in body.findChildren():
    print(item)

Output

<h1>page title</h1>
<p>some text</p>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM