简体   繁体   English

Python 或 PETL 解析 XML

[英]Python or PETL Parsing XML

I have been playing with PETL and seeing if I could extract multiple xml files and combine them into one.我一直在玩PETL,看看我是否可以提取多个xml文件并将它们组合成一个。

I have no control over the structure of the XML files, Here are the variations I am seeing and which is giving my trouble.我无法控制 XML 文件的结构,以下是我看到的变化,这给我带来了麻烦。

XML File 1 Example: XML 文件 1 示例:

<?xml version="1.0" encoding="utf-8"?>
    <Export>
        <Info>
            <Name>John Doe</Name>
            <Date>01/01/2021</Date>
        </Info>
        <App>
            <Description></Description>
            <Type>Two</Type>
            <Details>
                <DetailOne>1</DetailOne>
                <DetailTwo>2</DetailTwo>
            </Details>
            <Details>
                <DetailOne>10</DetailOne>
                <DetailTwo>11</DetailTwo>
            </Details>
        </App>
    </Export>

XML File 2 Example: XML 文件 2 示例:

<?xml version="1.0" encoding="utf-8"?>
    <Export>
        <Info>
            <Name></Name>
            <Date>01/02/2021</Date>
        </Info>
        <App>
            <Description>Sample description here.</Description>
            <Type>One</Type>
            <Details>
                <DetailOne>1</DetailOne>
                <DetailTwo>2</DetailTwo>
                <DetailOne>3</DetailOne>
                <DetailTwo>4</DetailTwo>
            </Details>
            <Details>
                <DetailOne>10</DetailOne>
                <DetailTwo>11</DetailTwo>
            </Details>
        </App>
    </Export>

My python code is just scanning the subfolder xmlfiles and then trying to use PETL to parse from there.我的python代码只是扫描子文件夹xmlfiles,然后尝试使用PETL从那里解析。 With the structure of the documents, I am loading three tables so far:使用文档的结构,到目前为止我正在加载三个表:

1 to hold the Info name and date 2 to hold the description and type 3 to collect the details 1 保存信息名称和日期 2 保存描述并键入 3 收集详细信息

import petl as etl
import os
from lxml import etree

for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
    if filename.endswith('.xml'):
        # Get the info children
        table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
            'Name': 'Name',
            'Date': 'Date'
        })

        # Get the App children
        table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
            'Description': 'Description',
            'Type': 'Type'
        })

        # Get the App Details children
        table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
            'DetailOne': 'DetailOne',
            'DetailTwo': 'DetailTwo'
        })

        # concat
        c = etl.crossjoin(table1, table2, table3)
        # I want the filename added on
        result = etl.addfield(c, 'FileName', filename)

        print('Results:\n', result)
                

I concat the three tables because I want the Info and App data on each line with each detail.我连接了三个表,因为我想要每一行的 Info 和 App 数据以及每个细节。 This works until I get a XML file that has multiples of the DetailOne and DetailTwo elements.这一直有效,直到我得到一个包含多个 DetailOne 和 DetailTwo 元素的 XML 文件。

What I am getting as results is:我得到的结果是:

Results:结果:

 +------------+----------+-------------+------+-----------+-----------+----------+
| Date       | Name     | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None        | Two  | 1         | 2         | one.xml  |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None        | Two  | 10        | 11        | one.xml  |
+------------+----------+-------------+------+-----------+-----------+----------+

Results:结果:

 +------------+------+--------------------------+------+------------+------------+----------+
| Date       | Name | Description              | Type | DetailOne  | DetailTwo  | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One  | ('1', '3') | ('2', '4') | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 10         | 11         | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+

The second file showing DetailOne being ('1','3') and DetailTwo being ('2', '4') is not what I want.显示 DetailOne 为 ('1','3') 和 DetailTwo 为 ('2', '4') 的第二个文件不是我想要的。

What I want is:我想要的是:

+------------+------+--------------------------+------+------------+------------+----------+
| Date       | Name | Description              | Type | DetailOne  | DetailTwo  | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One  | 1          | 2          | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 3          | 4          | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 10         | 11         | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+

I believe XPath may be the way to go but after researching:我相信 XPath 可能是要走的路,但经过研究:

https://petl.readthedocs.io/en/stable/io.html#xml-files - doesn't go in depth on lxml and petl https://petl.readthedocs.io/en/stable/io.html#xml-files - 没有深入了解 lxml 和 petl

some light reading here: https://www.w3schools.com/xml/xpath_syntax.asp一些轻松阅读: https : //www.w3schools.com/xml/xpath_syntax.asp

some more reading here: https://lxml.de/tutorial.html在这里阅读更多内容: https : //lxml.de/tutorial.html

Any assistance on this is appreciated!对此的任何帮助表示赞赏!

First, thanks for taking the time to write a good question.首先,感谢您花时间写出一个好问题。 I'm happy to spend the time answering it.我很高兴花时间回答它。

I've never used PETL, but I did scan the docs for XML processing.我从未使用过 PETL,但我确实扫描了文档以进行 XML 处理。 I think your main problem is that the <Details> tag sometimes contains 1 pair of tags, and sometimes multiple pairs.我认为您的主要问题是<Details>标签有时包含一对标签,有时包含多对。 If only there was a way to extract a flat list of the and tag values, without the enclosing tags getting in the way...如果只有一种方法可以提取 和 标签值的平面列表,而没有封闭的标签妨碍......

Fortunately there is.幸运的是有。 I used https://www.webtoolkitonline.com/xml-xpath-tester.html and the XPath expression //Details/DetailOne returns the list 1,3,10 when applied to your example XML.我使用了https://www.webtoolkitonline.com/xml-xpath-tester.html并且 XPath 表达式//Details/DetailOne 1,3,10在应用于您的示例 XML 时返回列表1,3,10

So I suspect that something like this should work:所以我怀疑这样的事情应该有效:

import petl as etl
import os
from lxml import etree

for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
    if filename.endswith('.xml'):
        # Get the info children
        table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
            'Name': 'Name',
            'Date': 'Date'
        })

        # Get the App children
        table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
            'Description': 'Description',
            'Type': 'Type'
        })

        # Get the App Details children
        table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), '/App', {
            'DetailOne': '//DetailOne',
            'DetailTwo': '//DetailTwo'
        })

        # concat
        c = etl.crossjoin(table1, table2, table3)
        # I want the filename added on
        result = etl.addfield(c, 'FileName', filename)

        print('Results:\n', result)

The leading // may be redundant.前导 //可能是多余的。 It is XPath syntax for 'at any level in the document'.它是“在文档中的任何级别”的 XPath 语法。 I don't know how PETL processes the XPath so I'm trying to play safe.我不知道 PETL 如何处理 XPath,所以我试图安全。 I agree btw - the documentation is rather light on details.我同意顺便说一句-在文档上的细节,而光。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM