使用lxml解析xml - 提取元素值

Question

Let's suppose we have the XML file with the structure as follows. 假设我们的XML文件结构如下。

<?xml version="1.0" ?> 
<searchRetrieveResponse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/zing/srw/ http://www.loc.gov/standards/sru/sru1-1archive/xml-files/srw-types.xsd" xmlns="http://www.loc.gov/zing/srw/">
  <records xmlns:ns1="http://www.loc.gov/zing/srw/">
    <record>
      <recordData>
        <record xmlns="">
          <datafield tag="000">
            <subfield code="a">123</subfield>
            <subfield code="b">456</subfield>
          </datafield>
          <datafield tag="001">
            <subfield code="a">789</subfield>
            <subfield code="b">987</subfield>
          </datafield>
        </record>
      </recordData>
    </record>
    <record>
      <recordData>
        <record xmlns="">
          <datafield tag="000">
            <subfield code="a">123</subfield>
            <subfield code="b">456</subfield>
          </datafield>
          <datafield tag="001">
            <subfield code="a">789</subfield>
            <subfield code="b">987</subfield>
          </datafield>
        </record>
      </recordData>
    </record>
  </records>
</searchRetrieveResponse>

I need to parse out: 我需要解析：

The content of the "subfield" (eg 123 in the example above) and “子字段”的内容（例如上面示例中的123）和
Attribute values (eg 000 or 001) 属性值（例如000或001）

I wonder how to do that using lxml and XPath. 我想知道如何使用lxml和XPath。 Pasted below is my initial code and I kindly ask someone to explain me, how to parse out values. 粘贴在下面是我的初始代码，我请求有人解释我，如何解析价值。

import urllib, urllib2
from lxml import etree    

url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()

ns = {'xsi':'http://www.loc.gov/zing/srw/'}

for record in doc.xpath('//xsi:record', namespaces=ns):
    print record.xpath("xsi:recordData/record/datafield[@tag='000']", namespaces=ns)

Answer 1

I would be more direct in your XPath: go straight for the elements you want, in this case datafield . 我会在你的XPath中更直接：直接找到你想要的元素，在本例中为datafield 。

>>> for df in doc.xpath('//datafield'):
        # Iterate over attributes of datafield
        for attrib_name in df.attrib:
                print '@' + attrib_name + '=' + df.attrib[attrib_name]

        # subfield is a child of datafield, and iterate
        subfields = df.getchildren()
        for subfield in subfields:
                print 'subfield=' + subfield.text

Also, lxml appears to let you ignore the namespace, maybe because your example only uses one namespace? 此外，lxml似乎让您忽略命名空间，可能是因为您的示例仅使用一个命名空间？

Answer 2

Try the following working code : 请尝试以下工作代码：

import urllib2
from lxml import etree

url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()

for record in doc.xpath('//datafield'):
    print record.xpath("./@tag")[0]
    for x in record.xpath("./subfield/text()"):
        print "\t", x

Answer 3

I would just go with 我会跟着去

for df in doc.xpath('//datafield'):
    print df.attrib
    for sf in df.getchildren():
        print sf.text

Also you don't need urllib, you can directly parse XML with HTTP 此外，您不需要urllib，您可以使用HTTP直接解析XML

url = "http://dl.dropbox.com/u/540963/short_test.xml"  #doesn't work with https though
doc = etree.parse(url)

使用lxml解析xml - 提取元素值

问题描述

3 个解决方案

解决方案1
17 已采纳 2012-09-29 22:57:35

解决方案2
6 2012-09-29 22:54:57

解决方案3
6 2012-09-29 23:13:50

使用lxml解析xml - 提取元素值

问题描述

3 个解决方案

解决方案1 17 已采纳 2012-09-29 22:57:35

解决方案2 6 2012-09-29 22:54:57

解决方案3 6 2012-09-29 23:13:50

解决方案1
17 已采纳 2012-09-29 22:57:35

解决方案2
6 2012-09-29 22:54:57

解决方案3
6 2012-09-29 23:13:50