简体   繁体   中英

How do I access an inline element inside a loop in lxml?

I am trying to screen scrape values from a website.

# get the raw HTML
fruitsWebsite = lxml.html.parse( "http://pagetoscrape.com/data.html" )

# get all divs with class fruit 
fruits = fruitsWebsite.xpath( '//div[@class="fruit"]' )

# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
    print fruit.xpath('//li[@class="fruit"]/em')[0].text

However, the Python interpreter complains that 0 is an out of bounds iterator. That's interesting because I am sure that the element exists. What is the proper way to access the inside <em> element with lxml?

The following code works for me with my test file.

#test.py
import lxml.html

# get the raw HTML
fruitsWebsite = lxml.html.parse('test.html')

# get all divs with class fruit 
fruits = fruitsWebsite.xpath('//div[@class="fruit"]')

# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
    #Use a relative path so we don't find ALL of the li/em elements several times. Note the .//
    for item in fruit.xpath('.//li[@class="fruit"]/em'):
        print(item.text)


#Alternatively
for item in fruit.xpath('//div[@class="fruit"]//li[@class="fruit"]/em'):
    print(item.text)

Here is the html file I used to test again. If this doesn't work for the html you're testing again, you'll need to post a sample file that fails as I requested in the comments above.

<html>
<body>
Blah blah
<div>Ignore me</div>
<div>Outer stuff
    <div class='fruit'>Some <em>FRUITY</em> stuff.
    <ol>
        <li class='fruit'><em>This</em> should show</li>
        <li><em>Super</em> Ignored LI</li>
        <li class='fruit'><em>Rawr</em> Hear it roar.</li>
    </ol>
    </div>
</div>
<div class='fruit'><em>Super</em> fruity website of awesome</div>
</body>
</html>

You definitely will get too many results with the code you originally posted (the inner loop will search the entire tree rather than the subtree for each "fruit"). The error you're describing doesn't make much sense unless your input is different than what I understood.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM