简体   繁体   English

使用lxml从html提取属性

[英]extracting attributes from html with lxml

I use lxml to retrieve the attributes of tags from an html page. 我使用lxml从html页面检索标签的属性。 The html page is formatted like this: html页面的格式如下:

<div class="my_div">
    <a href="/foobar">
        <img src="my_img.png">
    </a>
</div>

The python script I use to retrieve the url inside the <a> tag and the src value of the <img> tag inside the same <div> , is this: 我用来检索<a>标记内的url和同一<div><img>标记的src值的python脚本是这样的:

from lxml import html 

...
tree = html.fromstring(page.text)
for element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = element.xpath('/@href')
    src = element.xpath('//img/@src')

Why don't I get the strings? 为什么我没有得到琴弦?

You are using lxml so you are operating with lxml objects - HtmlElement instances. 您正在使用lxml,因此要使用lxml对象-HtmlElement实例。 HtmlElement is nested from etree.Element: http://lxml.de/api/lxml.etree._Element-class.html , it have get method, that returns attrubute value. HtmlElement嵌套在etree.Element中: http ://lxml.de/api/lxml.etree._Element-class.html,它具有get方法,该方法返回attrubute值。 So the proper way for you is: 因此,适合您的正确方法是:

from lxml import html 

...
tree = html.fromstring(page.text)
for link_element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = link_element.get('href')
    image_element = href.find('img')
    if image_element:
        img_src = image_element.get('src') 

If you change your code to: 如果将代码更改为:

from lxml import html 

...
tree = html.fromstring(page.text)
for element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = element.items()[0][1]  #gives you the value corresponding to the key "href"
    src = element.xpath('//img/@src')[0]
    print(href, src)

You'll get what you need. 您将得到所需的东西。

The documentation of lxml mentions some of these things, but I feel it is lacking a few things and you might want to consider using an interactive python shell to study the properties of the instances returned by tree.xpath() . lxml文档中提到了其中一些内容,但是我觉得它缺少一些内容,您可能要考虑使用交互式python shell研究tree.xpath()返回的实例的属性。 Or you could look into another parser completely, such as BeautifulSoup , which has very good examples and documentation. 或者,您可以完全研究另一个解析器,例如BeautifulSoup ,它具有非常好的示例和文档。 Just sharing... 只是分享...

The reason why you didn't get the results you want is because you're trying to get attributes from the NEXT children rather than the existing node. 之所以没有获得所需的结果,是因为您试图从NEXT子级而不是现有节点中获取属性。

See this: 看到这个:

from lxml import html

s = '''<div class="my_div">
    <a href="/foobar">
        <img src="my_img.png">
    </a>
</div>'''

tree = html.fromstring(s)

# when you do path... //a, you are ALREADY at 'a' node
for el in tree.xpath('//div[contains(@class, "my_div")]//a'):
    # you were trying to get next children /@href, which doesn't exist
    print el.xpath('@href') # you should instead access the existing node's 
    print el.xpath('img/@src') # same here, not /img/@src ...

['/foobar']
['my_img.png']

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM