如何在lxml中访问循环内的内联元素？

Question

我正试图从网站上筛选值。

# get the raw HTML
fruitsWebsite = lxml.html.parse( "http://pagetoscrape.com/data.html" )

# get all divs with class fruit 
fruits = fruitsWebsite.xpath( '//div[@class="fruit"]' )

# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
    print fruit.xpath('//li[@class="fruit"]/em')[0].text

但是，Python解释器抱怨0是一个超出边界的迭代器。 这很有趣，因为我确信元素存在。 使用lxml访问内部<em>元素的正确方法是什么？

Answer 1

以下代码适用于我的测试文件。

#test.py
import lxml.html

# get the raw HTML
fruitsWebsite = lxml.html.parse('test.html')

# get all divs with class fruit 
fruits = fruitsWebsite.xpath('//div[@class="fruit"]')

# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
    #Use a relative path so we don't find ALL of the li/em elements several times. Note the .//
    for item in fruit.xpath('.//li[@class="fruit"]/em'):
        print(item.text)


#Alternatively
for item in fruit.xpath('//div[@class="fruit"]//li[@class="fruit"]/em'):
    print(item.text)

这是我以前再次测试的html文件。 如果这对你再次测试的html不起作用，你需要发布一个我在上面的评论中请求失败的示例文件。

<html>
<body>
Blah blah
<div>Ignore me</div>
<div>Outer stuff
    <div class='fruit'>Some <em>FRUITY</em> stuff.
    <ol>
        <li class='fruit'><em>This</em> should show</li>
        <li><em>Super</em> Ignored LI</li>
        <li class='fruit'><em>Rawr</em> Hear it roar.</li>
    </ol>
    </div>
</div>
<div class='fruit'><em>Super</em> fruity website of awesome</div>
</body>
</html>

使用最初发布的代码肯定会获得太多结果（内部循环将搜索整个树而不是每个“水果”的子树）。 除非您的输入与我理解的不同，否则您描述的错误没有多大意义。

如何在lxml中访问循环内的内联元素？

问题描述

1 个解决方案

解决方案1
2 已采纳 2012-02-12 03:29:50

如何在lxml中访问循环内的内联元素？

问题描述

1 个解决方案

解决方案1 2 已采纳 2012-02-12 03:29:50

解决方案1
2 已采纳 2012-02-12 03:29:50