简体   繁体   English

lxml 如何定位和检索多个元素值?

[英]lxml How Can I Target And Retrieve Multiple Element Values?

please consider the following HTML:请考虑以下 HTML:

<html>
    <body>
        <ul>
            <li><h5>Title 1</h5><div><span>Apples</span></li>
            <li><h5>Title 2</h5><div><span>Bananas</span></li>
            <li><h5>Title 3</h5><div><span>Grapes</span></li>
            <li><h5>Title 4</h5><div><span>Pears</span></li>
        </ul>
    </body>
</html>

Using lxml, I can easily retrieve the h5's:使用 lxml,我可以轻松检索 h5:

from lxml import html

example_html = '''<html>
    <body>
        <ul>
            <li><h5>Title 1</h5><div><span>Apples</span></li>
            <li><h5>Title 2</h5><div><span>Bananas</span></li>
            <li><h5>Title 3</h5><div><span>Grapes</span></li>
            <li><h5>Title 4</h5><div><span>Pears</span></li>
        </ul>
    </body>
</html>'''

tree = html.fromstring(example_html)

element_list = tree.xpath('//h5')

# List comprehension to get text
result = [i.text for i in element_list]

print(result)

From that code, of course the result will be:从该代码,当然结果将是:

['Title 1', 'Title 2', 'Title 3', 'Title 4']

But I need to know how to produce a result like this:但我需要知道如何产生这样的结果:

['Title 1', 'Apples', 'Title 2', 'Bananas', 'Title 3', 'Grapes', 'Title 4', 'Pears']

I tried modifying the code like this:我尝试像这样修改代码:

collector = []
for i in element_list:
    h5 = i.xpath('//h5')
    collector.append(h5[0].text)
    span = i.xpath('//span')
    collector.append(span[0].text)

print(collector)

But got this result (close but not quite):但得到了这个结果(接近但不完全):

['Title 1', 'Apples', 'Title 1', 'Apples', 'Title 1', 'Apples', 'Title 1', 'Apples']

Is this possible somehow?这有可能吗? I got as far as the above code and any help would be highly appreciated.我得到了上面的代码,任何帮助将不胜感激。 Thank you kindly.非常感谢你。

您可以使用联合,它按文档顺序返回结果。

e=tree.xpath("//li/h5|//li/div/span")

I'm not very familiar with lxml, but I have worked with beautiful soup.我对 lxml 不是很熟悉,但我用过漂亮的汤。 In case you are okay with switching try the following code:如果您可以切换,请尝试以下代码:

from bs4 import BeautifulSoup

example_html = '''<html>
    <body>
        <ul>
            <li><h5>Title 1</h5><div><span>Apples</span></li>
            <li><h5>Title 2</h5><div><span>Bananas</span></li>
            <li><h5>Title 3</h5><div><span>Grapes</span></li>
            <li><h5>Title 4</h5><div><span>Pears</span></li>
        </ul>
    </body>
</html>'''

soup = BeautifulSoup(example_html, 'html.parser')
list = []
for elem in soup.findAll('li'):
    list.append(elem.find('h5').text)
    list.append(elem.find('span').text)

print(list)

Hope this helps!希望这可以帮助!

Another solution, maybe you'll like it.另一种解决方案,也许你会喜欢它。

from simplified_scrapy import SimplifiedDoc
html = '''<html>
    <body>
        <ul>
            <li><h5>Title 1</h5><div><span>Apples</span></li>
            <li><h5>Title 2</h5><div><span>Bananas</span></li>
            <li><h5>Title 3</h5><div><span>Grapes</span></li>
            <li><h5>Title 4</h5><div><span>Pears</span></li>
        </ul>
    </body>
</html>'''
doc = SimplifiedDoc(html)
lis = doc.selects('li>(h5,span)')
print (lis)

Result:结果:

[['Title 1', 'Apples'], ['Title 2', 'Bananas'], ['Title 3', 'Grapes'], ['Title 4', 'Pears']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM