简体   繁体   English

Python + XML 文档

[英]Python + XML documents

I'm a bit new to XML and python.我对 XML 和 python 有点陌生。 Below is a cut down version of a large XML file I'm trying to bring into python to eventually write into SQL Server db.下面是我试图引入 python 以最终写入 SQL Server db 的大型 XML 文件的缩减版本。

<?xml version="1.0" encoding="utf-8"?>
<MyOrgRefData:OrgRefData xmlns:MyOrgRefData="http://refdata.org/org/v2-0-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.org/org/v2-0-0/MyOrgRefData.xsd">
  <Manifest>
    <Version value="2-0-0" />
    <PublicationType value="Full" />
    <PublicationSource value="TEST123" />
    <PublicationDate value="2022-05-23" />
    <PublicationSeqNum value="1659" />
    <FileCreationDateTime value="2022-05-23T22:14:47" />
    <RecordCount value="287654" />
    <ContentDescription value="FullFile_20220523" />
    <PrimaryRoleScope>
      <PrimaryRole id="123" displayName="Free beer for me" />
      <PrimaryRole id="456" displayName="Free air for you" />
    </PrimaryRoleScope>
  </Manifest>
  <CodeSystems>
    <CodeSystem name="OrganisationRecordClass" oid="1.2.3.4.5">
      <concept id="RC2" code="2" displayName="World1" />
      <concept id="RC1" code="1" displayName="World2" />
    </CodeSystem>
    <CodeSystem name="OrganisationRole" oid="5.4.7.8">
      <concept id="B1ng0" code="179" displayName="BoomBastic" />
      <concept id="R2D2a" code="180" displayName="Fantastic" />
    </CodeSystem>
  </CodeSystems>
</MyOrgRefData:OrgRefData>

I've tried with lxml, pandas.read_xml, xml.etree and I'm not able to understand how to get what I want.我已经尝试过使用 lxml、pandas.read_xml、xml.etree,但我无法理解如何获得我想要的东西。

Ideally I'd like to pull in Manifest into a dataframe ready to to send to SQL (pd.to_sql()).理想情况下,我想将Manifest拉入准备发送到 SQL (pd.to_sql()) 的数据帧中。 I would do the same with CodeSystems as well, but separately.我也会对CodeSystems做同样的事情,但要分开做。 (there are other sections but I cut them off to shorten) (还有其他部分,但我将它们剪掉以缩短)

For example, using pandas to read in, I can only get a column with the values in. But I would like to either have the tag (Version, PublicationType, PublicationSource etc) in a column by the side of the value, or have them as the column headers and the values pivoted across the row instead.例如,使用熊猫读入,我只能得到一个包含值的列。但我想在值旁边的列中包含标签(版本、出版物类型、出版物来源等),或者让它们作为列标题和值跨行旋转。

dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
    dataFolder+'Data_Full_20220523.xml', 
    xpath='//Manifest/*', 
    attrs_only=True ,
    )
df_bulk.head()

This is the output I get:这是我得到的输出:

inx索引 value价值
0 0 2-0-0 2-0-0
1 1 Full满的
2 2 TEST123测试123
3 3 2022-05-23 2022-05-23
4 4 1659 1659
5 5 2022-05-23T22:14:47 2022-05-23T22:14:47
6 6 287654 287654
7 7 FullFile_20220523 FullFile_20220523

Ideally I would like:理想情况下,我想:

inx索引 value价值
Version版本 2-0-0 2-0-0
PublicationType出版物类型 Full满的
PublicationSource出版来源 TEST123测试123
PublicationDate发布日期 2022-05-23 2022-05-23
PublicationSeqNum发表序列号 1659 1659
FileCreationDateTime文件创建日期时间 2022-05-23T22:14:47 2022-05-23T22:14:47
FileCreationDateTime文件创建日期时间 287654 287654
ContentDescription内容描述 FullFile_20220523 FullFile_20220523

The eagle eyed among you will notice I've left out PrimaryRoleScope .你们中的老鹰眼会注意到我遗漏了PrimaryRoleScope I would ideally like to treat this separately in it's own dataframe as well.理想情况下,我也希望在它自己的数据框中单独处理它。 But I am unsure how to exclude it when pulling in the rest of the Manifest section.但是我不确定在拉入 Manifest 部分的其余部分时如何排除它。

Many thanks if you've read this far, even more thanks for any help.非常感谢你读到这里,更感谢你的帮助。

One possibility is using the stylesheet parameter to transform the XML data internally with XSLT before processing it.一种可能性是使用stylesheet参数在处理 XML 数据之前使用 XSLT 在内部对其进行转换。

So your code could look like this:因此,您的代码可能如下所示:

dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
    dataFolder+'Data_Full_20220523.xml', 
    stylesheet='transform.xslt',
    xpath='/Root/Item', 
    attrs_only=True ,
    )
print(df_bulk.head(10))

The stylesheet( transform.xml ) to be passed to read_xml could be (lxml is required)要传递给read_xml的样式表( transform.xml )可以是(需要 lxml)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>    
       
    <xsl:template match="/">
        <Root><xsl:apply-templates /></Root>
    </xsl:template>

    <xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
        <Item name="{name()}" value="{@value}" />
    </xsl:template>
    
</xsl:stylesheet>

In this example a new XML like the following is created.在此示例中,创建了一个如下所示的新 XML。 It is intermediate XML and not shown, but the xpath= parameter above has to be set accordingly.它是中间 XML,未显示,但必须相应地设置上面的xpath=参数。

<Root>
    <Item name="Version" value="2-0-0"/>
    <Item name="PublicationType" value="Full"/>
    <Item name="PublicationSource" value="TEST123"/>
    <Item name="PublicationDate" value="2022-05-23"/>
    <Item name="PublicationSeqNum" value="1659"/>
    <Item name="FileCreationDateTime" value="2022-05-23T22:14:47"/>
    <Item name="RecordCount" value="287654"/>
    <Item name="ContentDescription" value="FullFile_20220523"/>
</Root>

And the final output is最后的输出是

                   name                value
0               Version                2-0-0
1       PublicationType                 Full
2     PublicationSource              TEST123
3       PublicationDate           2022-05-23
4     PublicationSeqNum                 1659
5  FileCreationDateTime  2022-05-23T22:14:47
6           RecordCount               287654
7    ContentDescription    FullFile_20220523

The above approach uses only attributes, but you could also create an element structure with the XSLT if you prefer that.上述方法仅使用属性,但如果您愿意,也可以使用 XSLT 创建元素结构。 In this case change one template to在这种情况下,将一个模板更改为

<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
    <Item>
        <name><xsl:value-of select="name()" /></name>
        <value><xsl:value-of select="@value" /></value>
   </Item>
</xsl:template>

and your python code to和你的python代码

dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
    dataFolder+'Data_Full_20220523.xml', 
    stylesheet='transform.xslt',
    xpath='/Root/Item', 
    )
print(df_bulk.head(10))

The output is the same.输出是一样的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM