使用lxml按属性查找元素

Question

I need to parse a xml file to extract some data. 我需要解析一个xml文件来提取一些数据。 I only need some elements with certain attributes, here's an example of document: 我只需要一些具有某些属性的元素，这里是一个文档示例：

<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>

Here I would like to get only the article with the type "news". 在这里，我想只获得“新闻”类型的文章。 What's the most efficient and elegant way to do it with lxml? 用lxml做最有效和最优雅的方法是什么？

I tried with the find method but it's not very nice: 我尝试使用find方法，但它不是很好：

from lxml import etree
f = etree.parse("myfile")
root = f.getroot()
articles = root.getchildren()[0]
article_list = articles.findall('article')
for article in article_list:
    if "type" in article.keys():
        if article.attrib['type'] == 'news':
            content = article.find('content')
            content = content.text

Answer 1

You can use xpath, eg root.xpath("//article[@type='news']") 你可以使用xpath，例如root.xpath("//article[@type='news']")

This xpath expression will return a list of all <article/> elements with "type" attributes with value "news". 此xpath表达式将返回所有<article/>元素的列表，其中“type”属性的值为“news”。 You can then iterate over it to do what you want, or pass it wherever. 然后，您可以迭代它以执行您想要的操作，或者将其传递到任何地方。

To get just the text content, you can extend the xpath like so: 要获得文本内容，您可以像这样扩展xpath：

root = etree.fromstring("""
<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>
""")

print root.xpath("//article[@type='news']/content/text()")

and this will output ['some text', 'some text'] . 这将输出['some text', 'some text'] 。 Or if you just wanted the content elements, it would be "//article[@type='news']/content" -- and so on. 或者，如果您只是想要内容元素，那么它将是"//article[@type='news']/content" - 依此类推。

Answer 2

Just for reference, you can achieve the same result with findall : 仅供参考，您可以使用findall获得相同的结果：

root = etree.fromstring("""
<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>
""")

articles = root.find("articles")
article_list = articles.findall("article[@type='news']/content")
for a in article_list:
    print a.text

使用lxml按属性查找元素

问题描述

2 个解决方案

解决方案1
71 已采纳 2011-02-23 15:36:09

解决方案2
9 2015-02-02 10:09:55

使用lxml按属性查找元素

问题描述

2 个解决方案

解决方案1 71 已采纳 2011-02-23 15:36:09

解决方案2 9 2015-02-02 10:09:55

解决方案1
71 已采纳 2011-02-23 15:36:09

解决方案2
9 2015-02-02 10:09:55