How do I use lxml and python to traverse the <body> of a html document along with its children

Question

I would like to take an html document and traverse the <body> part of the document with its children. I see lots of examples to get a subtree via xpath or tag name but this doesn't seem to give the children.

import lxml
from lxml import html, etree  

html3 = "<html><head><title>test<body><h1>page title</h3><p>some text</p>"
root = lxml.html.fromstring(html3)
tree = etree.ElementTree(root)
for el in root.iter():
    # do something
    print(el.text, tree.getpath(el))

This will output

None /html
None /html/head
test /html/head/title
None /html/body
page title /html/body/h1
some text /html/body/p

I would like only

page title /html/body/h1
some text /html/body/p

Any help gratefully received.

Answer 1

I had similar difficulty, then I figured that each etree node has an iterator if its parent using which you can traverse

for instance, root here will give you the body using that you can iterate each element of body

from lxml import etree
parser = etree.HTMLParser()
tree   = etree.parse('yourdocument.html', parser)

root = tree.xpath('/html/body/')[0]
for i in root.getiterator():
    print(i.tag,i.text)

Answer 2

It seems that your html code has an invalid format, I just wrote a little program with beautifuSoup that maybe you can use to modify for your purpose:

from bs4 import BeautifulSoup
html3 = "<html><head><title>test</title></head><body><h1>page title</h1><p>some text</p><body></html>"
soup = BeautifulSoup(html3, "html5lib")
body = soup.find('body')

for item in body.findChildren():
    print(item)

Output

<h1>page title</h1>
<p>some text</p>

How do I use lxml and python to traverse the <body> of a html document along with its children

Question

2 answers

solution1
2 ACCPTED 2018-02-28 05:11:33

solution2
0 2018-02-28 05:13:30

How do I use lxml and python to traverse the <body> of a html document along with its children

Question

2 answers

solution1 2 ACCPTED 2018-02-28 05:11:33

solution2 0 2018-02-28 05:13:30

solution1
2 ACCPTED 2018-02-28 05:11:33

solution2
0 2018-02-28 05:13:30