[英]Python or PETL Parsing XML
I have been playing with PETL and seeing if I could extract multiple xml files and combine them into one.我一直在玩PETL,看看我是否可以提取多个xml文件并将它们组合成一个。
I have no control over the structure of the XML files, Here are the variations I am seeing and which is giving my trouble.我无法控制 XML 文件的结构,以下是我看到的变化,这给我带来了麻烦。
XML File 1 Example: XML 文件 1 示例:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name>John Doe</Name>
<Date>01/01/2021</Date>
</Info>
<App>
<Description></Description>
<Type>Two</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
XML File 2 Example: XML 文件 2 示例:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name></Name>
<Date>01/02/2021</Date>
</Info>
<App>
<Description>Sample description here.</Description>
<Type>One</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
<DetailOne>3</DetailOne>
<DetailTwo>4</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
My python code is just scanning the subfolder xmlfiles and then trying to use PETL to parse from there.我的python代码只是扫描子文件夹xmlfiles,然后尝试使用PETL从那里解析。 With the structure of the documents, I am loading three tables so far:
使用文档的结构,到目前为止我正在加载三个表:
1 to hold the Info name and date 2 to hold the description and type 3 to collect the details 1 保存信息名称和日期 2 保存描述并键入 3 收集详细信息
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
'DetailOne': 'DetailOne',
'DetailTwo': 'DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)
I concat the three tables because I want the Info and App data on each line with each detail.我连接了三个表,因为我想要每一行的 Info 和 App 数据以及每个细节。 This works until I get a XML file that has multiples of the DetailOne and DetailTwo elements.
这一直有效,直到我得到一个包含多个 DetailOne 和 DetailTwo 元素的 XML 文件。
What I am getting as results is:我得到的结果是:
Results:结果:
+------------+----------+-------------+------+-----------+-----------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None | Two | 1 | 2 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None | Two | 10 | 11 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
Results:结果:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | ('1', '3') | ('2', '4') | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
The second file showing DetailOne being ('1','3') and DetailTwo being ('2', '4') is not what I want.显示 DetailOne 为 ('1','3') 和 DetailTwo 为 ('2', '4') 的第二个文件不是我想要的。
What I want is:我想要的是:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | 1 | 2 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 3 | 4 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
I believe XPath may be the way to go but after researching:我相信 XPath 可能是要走的路,但经过研究:
https://petl.readthedocs.io/en/stable/io.html#xml-files - doesn't go in depth on lxml and petl https://petl.readthedocs.io/en/stable/io.html#xml-files - 没有深入了解 lxml 和 petl
some light reading here: https://www.w3schools.com/xml/xpath_syntax.asp一些轻松阅读: https : //www.w3schools.com/xml/xpath_syntax.asp
some more reading here: https://lxml.de/tutorial.html在这里阅读更多内容: https : //lxml.de/tutorial.html
Any assistance on this is appreciated!对此的任何帮助表示赞赏!
First, thanks for taking the time to write a good question.首先,感谢您花时间写出一个好问题。 I'm happy to spend the time answering it.
我很高兴花时间回答它。
I've never used PETL, but I did scan the docs for XML processing.我从未使用过 PETL,但我确实扫描了文档以进行 XML 处理。 I think your main problem is that the
<Details>
tag sometimes contains 1 pair of tags, and sometimes multiple pairs.我认为您的主要问题是
<Details>
标签有时包含一对标签,有时包含多对。 If only there was a way to extract a flat list of the and tag values, without the enclosing tags getting in the way...如果只有一种方法可以提取 和 标签值的平面列表,而没有封闭的标签妨碍......
Fortunately there is.幸运的是有。 I used https://www.webtoolkitonline.com/xml-xpath-tester.html and the XPath expression
//Details/DetailOne
returns the list 1,3,10
when applied to your example XML.我使用了https://www.webtoolkitonline.com/xml-xpath-tester.html并且 XPath 表达式
//Details/DetailOne
1,3,10
在应用于您的示例 XML 时返回列表1,3,10
。
So I suspect that something like this should work:所以我怀疑这样的事情应该有效:
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), '/App', {
'DetailOne': '//DetailOne',
'DetailTwo': '//DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)
The leading // may be redundant.前导 //可能是多余的。 It is XPath syntax for 'at any level in the document'.
它是“在文档中的任何级别”的 XPath 语法。 I don't know how PETL processes the XPath so I'm trying to play safe.
我不知道 PETL 如何处理 XPath,所以我试图安全。 I agree btw - the documentation is rather light on details.
我同意顺便说一句-在文档上的细节,而光。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.