如何在lxml中访问循环内的内联元素？

Question

I am trying to screen scrape values from a website. 我正试图从网站上筛选值。

# get the raw HTML
fruitsWebsite = lxml.html.parse( "http://pagetoscrape.com/data.html" )

# get all divs with class fruit 
fruits = fruitsWebsite.xpath( '//div[@class="fruit"]' )

# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
    print fruit.xpath('//li[@class="fruit"]/em')[0].text

However, the Python interpreter complains that 0 is an out of bounds iterator. 但是，Python解释器抱怨0是一个超出边界的迭代器。 That's interesting because I am sure that the element exists. 这很有趣，因为我确信元素存在。 What is the proper way to access the inside <em> element with lxml? 使用lxml访问内部<em>元素的正确方法是什么？

Answer 1

The following code works for me with my test file. 以下代码适用于我的测试文件。

#test.py
import lxml.html

# get the raw HTML
fruitsWebsite = lxml.html.parse('test.html')

# get all divs with class fruit 
fruits = fruitsWebsite.xpath('//div[@class="fruit"]')

# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
    #Use a relative path so we don't find ALL of the li/em elements several times. Note the .//
    for item in fruit.xpath('.//li[@class="fruit"]/em'):
        print(item.text)


#Alternatively
for item in fruit.xpath('//div[@class="fruit"]//li[@class="fruit"]/em'):
    print(item.text)

Here is the html file I used to test again. 这是我以前再次测试的html文件。 If this doesn't work for the html you're testing again, you'll need to post a sample file that fails as I requested in the comments above. 如果这对你再次测试的html不起作用，你需要发布一个我在上面的评论中请求失败的示例文件。

<html>
<body>
Blah blah
<div>Ignore me</div>
<div>Outer stuff
    <div class='fruit'>Some <em>FRUITY</em> stuff.
    <ol>
        <li class='fruit'><em>This</em> should show</li>
        <li><em>Super</em> Ignored LI</li>
        <li class='fruit'><em>Rawr</em> Hear it roar.</li>
    </ol>
    </div>
</div>
<div class='fruit'><em>Super</em> fruity website of awesome</div>
</body>
</html>

You definitely will get too many results with the code you originally posted (the inner loop will search the entire tree rather than the subtree for each "fruit"). 使用最初发布的代码肯定会获得太多结果（内部循环将搜索整个树而不是每个“水果”的子树）。 The error you're describing doesn't make much sense unless your input is different than what I understood. 除非您的输入与我理解的不同，否则您描述的错误没有多大意义。

如何在lxml中访问循环内的内联元素？

问题描述

1 个解决方案

解决方案1
2 已采纳 2012-02-12 03:29:50

如何在lxml中访问循环内的内联元素？

问题描述

1 个解决方案

解决方案1 2 已采纳 2012-02-12 03:29:50

解决方案1
2 已采纳 2012-02-12 03:29:50