[英]Parsing XPath within non standard XML using lxml Python
I'm trying to create a database of all patent information from Google Patents. 我正在尝试创建一个包含Google Patents中所有专利信息的数据库。 Much of my work so far has been using this very good answer from MattH in Python to parse non-standard XML file . 到目前为止,我的大部分工作都是使用Python中 MattH的这个很好的答案来解析非标准XML文件 。 My Python is too large to display so its linked here . 我的Python太大而无法显示,因此在此处链接。
The source files are here : a bunch of xml files appended together into one file with multiple headers.The issue is trying to use the correct xpath expression when parsing this unsual "non-standard" XML file which has multiple xml and dtd declarations. 源文件在这里 :一堆xml文件附加到一个具有多个头的文件中。问题是在解析具有多个xml和dtd声明的异常“非标准” XML文件时,尝试使用正确的xpath表达式。 I have been trying to use "-".join(doc.xpath
to tie everything together when its parsed out but the output creates blanks separated by hyphens for the <document-id>
and <classification-national>
shown below 我一直在尝试使用"-".join(doc.xpath
解析所有内容时将它们绑在一起,但是输出为以下所示的<document-id>
和<classification-national>
创建由连字符分隔的空格
<references-cited> <citation>
<patcit num="00001"> <document-id>
<country>US</country>
<doc-number>534632</doc-number>
<kind>A</kind>
<name>Coleman</name>
<date>18950200</date>
</document-id> </patcit>
<category>cited by examiner</category>
<classification-national><country>US</country>
<main-classification>249127</main-classification></classification-national>
</citation>
Note not all children exist within each <citation>
, sometimes they are not present at all. 请注意,并非每个<citation>
都存在所有子级,有时它们根本不存在。
How can I parse this xpath while trying to place hyphens between each data entry for multiple entries under <citation>
? 尝试在<citation>
下的多个条目的每个数据条目之间放置连字符时,如何解析此xpath?
From this XML (references.xml), 通过此XML(references.xml),
<references-cited>
<citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>534632</doc-number>
<kind>A</kind>
<name>Coleman</name>
<date>18950200</date>
</document-id>
</patcit>
<category>cited by examiner</category>
<classification-national>
<country>US</country>
<main-classification>249127</main-classification>
</classification-national>
</citation>
<citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D28957</doc-number>
<kind>S</kind>
<name>Simon</name>
<date>18980600</date>
</document-id>
</patcit>
<category>cited by other</category>
</citation>
</references-cited>
you can get the text content of every descendant of <citation>
that has any content as follows: 您可以获得具有以下内容的<citation>
的每个后代的文本内容:
from lxml import etree
doc = etree.parse("references.xml")
cits = doc.xpath('/references-cited/citation')
for c in cits:
descs = c.xpath('.//*')
for d in descs:
if d.text and d.text.strip():
print "%s: %s" %(d.tag, d.text)
print
Output: 输出:
country: US
doc-number: 534632
kind: A
name: Coleman
date: 18950200
category: cited by examiner
country: US
main-classification: 249127
country: US
doc-number: D28957
kind: S
name: Simon
date: 18980600
category: cited by other
This variation: 这种变化:
import sys
from lxml import etree
doc = etree.parse("references.xml")
cits = doc.xpath('/references-cited/citation')
for c in cits:
descs = c.xpath('.//*')
for d in descs:
if d.text and d.text.strip():
sys.stdout.write("-%s" %(d.text))
print
results in this output: 结果如下:
-US-534632-A-Coleman-18950200-cited by examiner-US-249127
-US-D28957-S-Simon-18980600-cited by other
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.