Python 提取子標簽不一致的 XML 數據

Question

我有一個 XML 文件，我需要從中提取數據並插入到數據庫表中。 我的掙扎是 XML 數據結構可能包含不一致的子標簽。 這意味着（在下面的示例中）一個父<Field>標簽可能包含也可能不包含<ListValue>標簽。

這是一個簡短的示例，我將添加可能包含另一個<ListValue>標記的附加<Field>標記。 注意：所有<Field>標記都應保持在<Record>標記下方的同一級別。

我想看看是否有人有比我下面的示例更“pythonic”的方式來轉換這些數據。 也許與列表理解？

我需要將多達 4,000,000 個<Record>級別的數據行插入到數據庫中，因此我不想浪費更多時間在 XML 中進行不必要的循環。 速度將是必不可少的。

任何幫助將不勝感激。

<?xml version="1.0" encoding="utf-16"?>
<Records count="10">
    <Metadata>
        <FieldDefinitions>
            <FieldDefinition id="15084" guid="f3426157-cbcb-4293-94e5-9f1c993db4b5" name="CCR_ID" alias="CCR_ID" />
            <FieldDefinition id="16335" guid="5dfddb49-9a7a-46ee-9bd2-d5bbed97a48d" name="Coming Due" alias="Coming_Due" />
        </FieldDefinitions>
    </Metadata>
    <LevelCounts>
        <LevelCount id="35" guid="661c747f-7ce5-474a-b320-044aaec7a5b1" count="10" />
    </LevelCounts>
    <Record contentId="20196771" levelId="35" levelGuid="661c747f-7ce5-474a-b320-044aaec7a5b1" moduleId="265" parentId="0">
        <Field id="15084" guid="f3426157-cbcb-4293-94e5-9f1c993db4b5" type="1">100383-320-V0217111</Field>
        <Field id="16335" guid="5dfddb49-9a7a-46ee-9bd2-d5bbed97a48d" type="4">
            <ListValues>
                <ListValue id="136572" displayName="121 - 180 days out">121 - 180 days out</ListValue>
            </ListValues>
        </Field>
    </Record>
    <Record contentId="20205193" levelId="35" levelGuid="661c747f-7ce5-474a-b320-044aaec7a5b1" moduleId="265" parentId="0">
        <Field id="15084" guid="f3426157-cbcb-4293-94e5-9f1c993db4b5" type="1">100383-320-V0217267</Field>
        <Field id="16335" guid="5dfddb49-9a7a-46ee-9bd2-d5bbed97a48d" type="4">
            <ListValues>
                <ListValue id="136572" displayName="121 - 180 days out">121 - 180 days out</ListValue>
            </ListValues>
        </Field>
    </Record>
    <Record contentId="20196779" levelId="35" levelGuid="661c747f-7ce5-474a-b320-044aaec7a5b1" moduleId="265" parentId="0">
        <Field id="15084" guid="f3426157-cbcb-4293-94e5-9f1c993db4b5" type="1">100384-320-V0217111</Field>
        <Field id="16335" guid="5dfddb49-9a7a-46ee-9bd2-d5bbed97a48d" type="4">
            <ListValues>
                <ListValue id="136572" displayName="121 - 180 days out">121 - 180 days out</ListValue>
            </ListValues>
        </Field>
    </Record>
</Records>

這是我解析數據的代碼：

from xml.etree import ElementTree
import pandas as pd

xml_string = '''SEE STRING ABOVE'''

auth_token = ElementTree.fromstring(xml_string.text)

dct = []
cols = ['CCR_ID', 'Coming_Due']

for r in auth_token.findall("Record"):
    for f in r.findall("Field"):

        if f.attrib['id'] == '15084':
            ccr_id = f.text

        for l in f.findall(".//ListValue"):
            coming_due = l.text

    dct.append((ccr_id, coming_due))


df = pd.DataFrame(dct)
df.columns = cols

print(df)

這是我的結果：

                CCR_ID Coming_Due
0  100383-320-V0217111    121 - 180 days out
1  100383-320-V0217267    121 - 180 days out
2  100384-320-V0217111    121 - 180 days out
3  100384-320-V0217267    121 - 180 days out
4  100681-320-V0217111    121 - 180 days out
5  100681-320-V0217267      11 - 30 days out
6  100684-320-V0217111    121 - 180 days out
7  100684-320-V0217267      11 - 30 days out
8  100685-320-V0217111    121 - 180 days out
9  100685-320-V0217267      11 - 30 days out

Answer 1

如果我理解正確，使用 pandas read_xml()可能會有所幫助：

df = pd.read_xml(string,"//Record//*")
df2= df[['Field','displayName']].copy()
df2['displayName'] = df2['displayName'].shift(-3)
df2.set_axis(['CCR_ID', 'Coming_Due'], axis=1,inplace=True)
df2.dropna()

Output 基於您的樣品 xml：

    Field   displayName
0   100383-320-V0217111     121 - 180 days out
4   100383-320-V0217267     121 - 180 days out
8   100384-320-V0217111     121 - 180 days out

Python 提取子標簽不一致的 XML 數據

問題描述

1 個解決方案

解決方案1
0 2021-12-16 00:37:23

Python 提取子標簽不一致的 XML 數據

問題描述

1 個解決方案

解決方案1 0 2021-12-16 00:37:23

解決方案1
0 2021-12-16 00:37:23