使用lxml Python在非標准XML中解析XPath

Question

我正在嘗試創建一個包含Google Patents中所有專利信息的數據庫。 到目前為止，我的大部分工作都是使用Python中 MattH的這個很好的答案來解析非標准XML文件。 我的Python太大而無法顯示，因此在此處鏈接。

源文件在這里：一堆xml文件附加到一個具有多個頭的文件中。問題是在解析具有多個xml和dtd聲明的異常“非標准” XML文件時，嘗試使用正確的xpath表達式。 我一直在嘗試使用"-".join(doc.xpath解析所有內容時將它們綁在一起，但是輸出為以下所示的<document-id>和<classification-national>創建由連字符分隔的空格

<references-cited> <citation> 
<patcit num="00001"> <document-id>
<country>US</country> 
<doc-number>534632</doc-number> 
<kind>A</kind>
<name>Coleman</name> 
<date>18950200</date> 
</document-id> </patcit>
<category>cited by examiner</category>
<classification-national><country>US</country>
<main-classification>249127</main-classification></classification-national>
</citation>

請注意，並非每個<citation>都存在所有子級，有時它們根本不存在。

嘗試在<citation>下的多個條目的每個數據條目之間放置連字符時，如何解析此xpath？

Answer 1

通過此XML（references.xml），

<references-cited> 
  <citation> 
    <patcit num="00001"> 
      <document-id>
        <country>US</country> 
        <doc-number>534632</doc-number> 
        <kind>A</kind>
        <name>Coleman</name> 
        <date>18950200</date> 
      </document-id> 
    </patcit>
    <category>cited by examiner</category>
    <classification-national>
      <country>US</country>
      <main-classification>249127</main-classification>
    </classification-national>
  </citation>

  <citation>
    <patcit num="00002">
      <document-id>
        <country>US</country>
        <doc-number>D28957</doc-number>
        <kind>S</kind>
        <name>Simon</name>
        <date>18980600</date>
      </document-id>
    </patcit>
    <category>cited by other</category>
  </citation>
</references-cited>

您可以獲得具有以下內容的<citation>的每個后代的文本內容：

from lxml import etree

doc = etree.parse("references.xml")
cits = doc.xpath('/references-cited/citation')

for c in cits:
    descs = c.xpath('.//*')
    for d in descs:
        if d.text and d.text.strip():
            print "%s: %s"  %(d.tag, d.text)
    print

輸出：

country: US
doc-number: 534632
kind: A
name: Coleman
date: 18950200
category: cited by examiner
country: US
main-classification: 249127

country: US
doc-number: D28957
kind: S
name: Simon
date: 18980600
category: cited by other

這種變化：

import sys
from lxml import etree

doc = etree.parse("references.xml")
cits = doc.xpath('/references-cited/citation')

for c in cits:
    descs = c.xpath('.//*')
    for d in descs:
        if d.text and d.text.strip():
            sys.stdout.write("-%s"  %(d.text))
    print

結果如下：

-US-534632-A-Coleman-18950200-cited by examiner-US-249127
-US-D28957-S-Simon-18980600-cited by other

使用lxml Python在非標准XML中解析XPath

問題描述

1 個解決方案

解決方案1
1 2012-02-27 19:37:30

使用lxml Python在非標准XML中解析XPath

問題描述

1 個解決方案

解決方案1 1 2012-02-27 19:37:30

解決方案1
1 2012-02-27 19:37:30