使用lxml从html提取属性

Question

I use lxml to retrieve the attributes of tags from an html page. 我使用lxml从html页面检索标签的属性。 The html page is formatted like this: html页面的格式如下：

<div class="my_div">
    <a href="/foobar">
        <img src="my_img.png">
    </a>
</div>

The python script I use to retrieve the url inside the <a> tag and the src value of the <img> tag inside the same <div> , is this: 我用来检索<a>标记内的url和同一<div>内<img>标记的src值的python脚本是这样的：

from lxml import html 

...
tree = html.fromstring(page.text)
for element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = element.xpath('/@href')
    src = element.xpath('//img/@src')

Why don't I get the strings? 为什么我没有得到琴弦？

Answer 1

You are using lxml so you are operating with lxml objects - HtmlElement instances. 您正在使用lxml，因此要使用lxml对象-HtmlElement实例。 HtmlElement is nested from etree.Element: http://lxml.de/api/lxml.etree._Element-class.html , it have get method, that returns attrubute value. HtmlElement嵌套在etree.Element中： http ://lxml.de/api/lxml.etree._Element-class.html，它具有get方法，该方法返回attrubute值。 So the proper way for you is: 因此，适合您的正确方法是：

from lxml import html 

...
tree = html.fromstring(page.text)
for link_element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = link_element.get('href')
    image_element = href.find('img')
    if image_element:
        img_src = image_element.get('src')

Answer 2

If you change your code to: 如果将代码更改为：

from lxml import html 

...
tree = html.fromstring(page.text)
for element in tree.xpath('//div[contains(@class, "my_div")]//a'):
    href = element.items()[0][1]  #gives you the value corresponding to the key "href"
    src = element.xpath('//img/@src')[0]
    print(href, src)

You'll get what you need. 您将得到所需的东西。

The documentation of lxml mentions some of these things, but I feel it is lacking a few things and you might want to consider using an interactive python shell to study the properties of the instances returned by tree.xpath() . lxml的文档中提到了其中一些内容，但是我觉得它缺少一些内容，您可能要考虑使用交互式python shell研究tree.xpath()返回的实例的属性。 Or you could look into another parser completely, such as BeautifulSoup , which has very good examples and documentation. 或者，您可以完全研究另一个解析器，例如BeautifulSoup ，它具有非常好的示例和文档。 Just sharing... 只是分享...

Answer 3

The reason why you didn't get the results you want is because you're trying to get attributes from the NEXT children rather than the existing node. 之所以没有获得所需的结果，是因为您试图从NEXT子级而不是现有节点中获取属性。

See this: 看到这个：

from lxml import html

s = '''<div class="my_div">
    <a href="/foobar">
        <img src="my_img.png">
    </a>
</div>'''

tree = html.fromstring(s)

# when you do path... //a, you are ALREADY at 'a' node
for el in tree.xpath('//div[contains(@class, "my_div")]//a'):
    # you were trying to get next children /@href, which doesn't exist
    print el.xpath('@href') # you should instead access the existing node's 
    print el.xpath('img/@src') # same here, not /img/@src ...

['/foobar']
['my_img.png']

Hope this helps. 希望这可以帮助。

使用lxml从html提取属性

问题描述

3 个解决方案

解决方案1
5 2014-11-24 17:07:58

解决方案2
0 2014-11-21 21:38:33

解决方案3
0 2014-11-22 03:49:25

使用lxml从html提取属性

问题描述

3 个解决方案

解决方案1 5 2014-11-24 17:07:58

解决方案2 0 2014-11-21 21:38:33

解决方案3 0 2014-11-22 03:49:25

解决方案1
5 2014-11-24 17:07:58

解决方案2
0 2014-11-21 21:38:33

解决方案3
0 2014-11-22 03:49:25