[英]Python + XML documents
I'm a bit new to XML and python.我对 XML 和 python 有点陌生。 Below is a cut down version of a large XML file I'm trying to bring into python to eventually write into SQL Server db.下面是我试图引入 python 以最终写入 SQL Server db 的大型 XML 文件的缩减版本。
<?xml version="1.0" encoding="utf-8"?>
<MyOrgRefData:OrgRefData xmlns:MyOrgRefData="http://refdata.org/org/v2-0-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.org/org/v2-0-0/MyOrgRefData.xsd">
<Manifest>
<Version value="2-0-0" />
<PublicationType value="Full" />
<PublicationSource value="TEST123" />
<PublicationDate value="2022-05-23" />
<PublicationSeqNum value="1659" />
<FileCreationDateTime value="2022-05-23T22:14:47" />
<RecordCount value="287654" />
<ContentDescription value="FullFile_20220523" />
<PrimaryRoleScope>
<PrimaryRole id="123" displayName="Free beer for me" />
<PrimaryRole id="456" displayName="Free air for you" />
</PrimaryRoleScope>
</Manifest>
<CodeSystems>
<CodeSystem name="OrganisationRecordClass" oid="1.2.3.4.5">
<concept id="RC2" code="2" displayName="World1" />
<concept id="RC1" code="1" displayName="World2" />
</CodeSystem>
<CodeSystem name="OrganisationRole" oid="5.4.7.8">
<concept id="B1ng0" code="179" displayName="BoomBastic" />
<concept id="R2D2a" code="180" displayName="Fantastic" />
</CodeSystem>
</CodeSystems>
</MyOrgRefData:OrgRefData>
I've tried with lxml, pandas.read_xml, xml.etree and I'm not able to understand how to get what I want.我已经尝试过使用 lxml、pandas.read_xml、xml.etree,但我无法理解如何获得我想要的东西。
Ideally I'd like to pull in Manifest into a dataframe ready to to send to SQL (pd.to_sql()).理想情况下,我想将Manifest拉入准备发送到 SQL (pd.to_sql()) 的数据帧中。 I would do the same with CodeSystems as well, but separately.我也会对CodeSystems做同样的事情,但要分开做。 (there are other sections but I cut them off to shorten) (还有其他部分,但我将它们剪掉以缩短)
For example, using pandas to read in, I can only get a column with the values in. But I would like to either have the tag (Version, PublicationType, PublicationSource etc) in a column by the side of the value, or have them as the column headers and the values pivoted across the row instead.例如,使用熊猫读入,我只能得到一个包含值的列。但我想在值旁边的列中包含标签(版本、出版物类型、出版物来源等),或者让它们作为列标题和值跨行旋转。
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
xpath='//Manifest/*',
attrs_only=True ,
)
df_bulk.head()
This is the output I get:这是我得到的输出:
inx索引 | value价值 |
---|---|
0 0 | 2-0-0 2-0-0 |
1 1 | Full满的 |
2 2 | TEST123测试123 |
3 3 | 2022-05-23 2022-05-23 |
4 4 | 1659 1659 |
5 5 | 2022-05-23T22:14:47 2022-05-23T22:14:47 |
6 6 | 287654 287654 |
7 7 | FullFile_20220523 FullFile_20220523 |
Ideally I would like:理想情况下,我想:
inx索引 | value价值 |
---|---|
Version版本 | 2-0-0 2-0-0 |
PublicationType出版物类型 | Full满的 |
PublicationSource出版来源 | TEST123测试123 |
PublicationDate发布日期 | 2022-05-23 2022-05-23 |
PublicationSeqNum发表序列号 | 1659 1659 |
FileCreationDateTime文件创建日期时间 | 2022-05-23T22:14:47 2022-05-23T22:14:47 |
FileCreationDateTime文件创建日期时间 | 287654 287654 |
ContentDescription内容描述 | FullFile_20220523 FullFile_20220523 |
The eagle eyed among you will notice I've left out PrimaryRoleScope .你们中的老鹰眼会注意到我遗漏了PrimaryRoleScope 。 I would ideally like to treat this separately in it's own dataframe as well.理想情况下,我也希望在它自己的数据框中单独处理它。 But I am unsure how to exclude it when pulling in the rest of the Manifest section.但是我不确定在拉入 Manifest 部分的其余部分时如何排除它。
Many thanks if you've read this far, even more thanks for any help.非常感谢你读到这里,更感谢你的帮助。
One possibility is using the stylesheet
parameter to transform the XML data internally with XSLT before processing it.一种可能性是使用stylesheet
参数在处理 XML 数据之前使用 XSLT 在内部对其进行转换。
So your code could look like this:因此,您的代码可能如下所示:
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
stylesheet='transform.xslt',
xpath='/Root/Item',
attrs_only=True ,
)
print(df_bulk.head(10))
The stylesheet( transform.xml
) to be passed to read_xml
could be (lxml is required)要传递给read_xml
的样式表( transform.xml
)可以是(需要 lxml)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:template match="/">
<Root><xsl:apply-templates /></Root>
</xsl:template>
<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
<Item name="{name()}" value="{@value}" />
</xsl:template>
</xsl:stylesheet>
In this example a new XML like the following is created.在此示例中,创建了一个如下所示的新 XML。 It is intermediate XML and not shown, but the xpath=
parameter above has to be set accordingly.它是中间 XML,未显示,但必须相应地设置上面的xpath=
参数。
<Root>
<Item name="Version" value="2-0-0"/>
<Item name="PublicationType" value="Full"/>
<Item name="PublicationSource" value="TEST123"/>
<Item name="PublicationDate" value="2022-05-23"/>
<Item name="PublicationSeqNum" value="1659"/>
<Item name="FileCreationDateTime" value="2022-05-23T22:14:47"/>
<Item name="RecordCount" value="287654"/>
<Item name="ContentDescription" value="FullFile_20220523"/>
</Root>
And the final output is最后的输出是
name value
0 Version 2-0-0
1 PublicationType Full
2 PublicationSource TEST123
3 PublicationDate 2022-05-23
4 PublicationSeqNum 1659
5 FileCreationDateTime 2022-05-23T22:14:47
6 RecordCount 287654
7 ContentDescription FullFile_20220523
The above approach uses only attributes, but you could also create an element structure with the XSLT if you prefer that.上述方法仅使用属性,但如果您愿意,也可以使用 XSLT 创建元素结构。 In this case change one template to在这种情况下,将一个模板更改为
<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
<Item>
<name><xsl:value-of select="name()" /></name>
<value><xsl:value-of select="@value" /></value>
</Item>
</xsl:template>
and your python code to和你的python代码
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
stylesheet='transform.xslt',
xpath='/Root/Item',
)
print(df_bulk.head(10))
The output is the same.输出是一样的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.