When lxml.etree performs XPath parsing on the web page, it is unable to obtain all the text

Question

When I use lxml.etree to parse HTML, if the target tag contains multiple tags, I cannot parse all the text using XPath. For example:

content = """
    <h3 id="author">
        <span>
            <a target="_blank">zhang</a>
        </span>
        <span>
            <a target="_blank">wang</a>
        </span>
        <p class="email">1234567@qq.com</p>
        <span>
            <a target="_blank">li</a>
        </span>
        <span>
            <a target="_blank">lin</a>
        </span>
    </h3>
"""

from lxml import etree
html_tree = etree.HTML(content)
print(html_tree.xpath('//h3[@id="author"]//text()'))

The result is:

['\n        ',
 '\n            ',
 'zhang',
 '\n        ',
 '\n        ',
 '\n            ',
 'wang',
 '\n        ',
 '\n        ']

I can't get the text "Li" and "Lin", But when I delete the P tag, I can get all the text. For example:

content = """
    <h3 id="author">
        <span>
            <a target="_blank">zhang</a>
        </span>
        <span>
            <a target="_blank">wang</a>
        </span>
        <span>
            <a target="_blank">li</a>
        </span>
        <span>
            <a target="_blank">lin</a>
        </span>
    </h3>
"""

from lxml import etree
html_tree = etree.HTML(content)
print(html_tree.xpath('//h3[@id="author"]//text()'))

The result is:

['\n        ',
 '\n            ',
 'zhang',
 '\n        ',
 '\n        ',
 '\n            ',
 'wang',
 '\n        ',
 '\n        ',
 '\n            ',
 'li',
 '\n        ',
 '\n        ',
 '\n            ',
 'lin',
 '\n        ',
 '\n    ']

python3.6.2
lxml (3.8.0)

Answer 1

If you parse from string, you will get the correct response:

from lxml import etree

content = """
    <h3 id="author">
        <span>
            <a target="_blank">zhang</a>
        </span>
        <span>
            <a target="_blank">wang</a>
        </span>
        <p class="email">1234567@qq.com</p>
        <span>
            <a target="_blank">li</a>
        </span>
        <span>
            <a target="_blank">lin</a>
        </span>
    </h3>
"""
root = etree.fromstring(content)
print(root.xpath('//h3[@id="author"]//text()'))

Result:

['\n        ', '\n            ', 'zhang', '\n        ', '\n        ', '\n            ', 'wang', '\n        ', '\n        ', '1234567@qq.com', '\n        ', '\n            ', 'li', '\n        ', '\n        ', '\n            ', 'lin', '\n        ', '\n    ']

When lxml.etree performs XPath parsing on the web page, it is unable to obtain all the text

Question

1 answers

solution1
0 2022-08-16 13:54:30

When lxml.etree performs XPath parsing on the web page, it is unable to obtain all the text

Question

1 answers

solution1 0 2022-08-16 13:54:30

solution1
0 2022-08-16 13:54:30