[英]Expanding XML data column in Pandas dataframe and save it as new columns
I have (11145, 14) shape dataset.我有 (11145, 14) 形状数据集。 In one of the column, I have a really complicated XML values.
在其中一个专栏中,我有一个非常复杂的 XML 值。 I am trying to expand this XML column and add them as new columns.
我正在尝试扩展此 XML 列并将它们添加为新列。 Here is one example of this XML: ( i changed the values for privacy reason but this is the structure)
这是此 XML 的一个示例:(出于隐私原因,我更改了值,但这是结构)
'
<?xml version="1.0" encoding="UTF-8"?>
<modulo
xmlns="http://www.sadasdasdasdasd.it/12312312312/Fasdasdasda"
xmlns:xsi="http://www.sss1231231233.org/200232321/XMLSchema-instance">
<nomeTxt dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="Nome" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdasdasdasdqw">
<![CDATA[*****]]>
</nomeTxt>
<adasdasdasdaq2qwdwasxasxas dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="asdasdasdqweqwe" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdasdasd2szszxc">
<![CDATA[*****]]>
</adasdasdasdaq2qwdwasxasxas>
<qweweqweqweqweqweqwe dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="sdsdsds" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasd">
<![CDATA[M]]>
</qweweqweqweqweqweqwe>
<qewtrweqrqwerqwrqweqw dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="qewtrweqrqwerqwrqweqw" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdas">
<![CDATA[213123123123]]>
</qewtrweqrqwerqwrqweqw>
<qewtrweqrqwerqwrqzxczxcasxcasxweqw dataFill="ew.fill(\'date\')" dataFillMode="auto" modelCodeMeaning="Data di nascita" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasfafassadasdasdasdas">
<![CDATA[1927-21-13]]>
</qewtrweqrqwerqwrqzxczxcasxcasxweqw>
<sadasdasdasdasdsa codeValue="0" codeScheme="asdasdasdasdasdasd" codeMeaning="No" codeSchemeVersion="01">
<![CDATA[No]]>
</rbg_allergiefarmacologiche>
<xczcxzcxzczxczxcz codeValue="0" codeScheme="asdasdasdasdasdasd" codeMeaning="No" codeSchemeVersion="01">
<![CDATA[No]]>
</xczcxzcxzczxczxcz>
<asdasfascasasxasx codeValue="0" codeScheme="asdasdasdas" codeMeaning="No" codeSchemeVersion="01">
<![CDATA[No]]>
</asdasfascasasxasx>
<asdasxcasxasxasxzxxz>
<![CDATA[false]]>
</asdasxcasxasxasxzxxz>
<asxasxasxsaxasx xsi:nil="true"></asxasxasxsaxasx>
<saxasx>
<![CDATA[false]]>
</saxasx>
<asdasxasxasxas xsi:nil="true"></asdasxasxasxas>
<asasdasdasdas>
<![CDATA[false]]>
</asasdasdasdas>
<asasdasdasdasasasasd xsi:nil="true"></asasdasdasdasasasasd>
<asasdasdasasd>
<![CDATA[false]]>
</asasdasdasasd>
<zcxzcxzc xsi:nil="true"></zcxzcxzc>
</modulo>'
I tried to search each column with for loop and and then tried to convert it as dictionary and then save it as columns.我尝试使用 for 循环搜索每一列,然后尝试将其转换为字典,然后将其保存为列。 The problem with this solution, in each row there are different <xml columns and number of them are different.
这个解决方案的问题是,每一行都有不同的 <xml 列并且它们的数量不同。 So my solution is not working.
所以我的解决方案不起作用。
df["XML_column"]
0 <?xml version="1.0" encoding="UTF-8"?><modulo ...
1 <?xml version="1.0" encoding="UTF-8"?><modulo ...
2 <?xml version="1.0" encoding="UTF-8"?><modulo ...
3 <?xml version="1.0" encoding="UTF-8"?><modulo ...
4 <?xml version="1.0" encoding="UTF-8"?><modulo ...
...
11140 <?xml version="1.0" encoding="UTF-8"?><modulo ...
11141 <?xml version="1.0" encoding="UTF-8"?><modulo ...
11142 <?xml version="1.0" encoding="UTF-8"?><modulo ...
11143 <?xml version="1.0" encoding="UTF-8"?><modulo ...
11144 <?xml version="1.0" encoding="UTF-8"?><modulo ...
Welcome.欢迎。 Your XML seems to be a bit bumpy.
您的 XML 似乎有点颠簸。 If I take a clean fragment, for instance this:
如果我拿一个干净的片段,例如这个:
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<modulo
xmlns="http://www.sadasdasdasdasd.it/12312312312/Fasdasdasda"
xmlns:xsi="http://www.sss1231231233.org/200232321/XMLSchema-instance">
<nomeTxt dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="Nome" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdasdasdasdqw">
<![CDATA[*****]]>
</nomeTxt>
<adasdasdasdaq2qwdwasxasxas dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="asdasdasdqweqwe" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdasdasd2szszxc">
<![CDATA[*****]]>
</adasdasdasdaq2qwdwasxasxas>
<qweweqweqweqweqweqwe dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="sdsdsds" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasd">
<![CDATA[M]]>
</qweweqweqweqweqweqwe>
<qewtrweqrqwerqwrqweqw dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="qewtrweqrqwerqwrqweqw" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdas">
<![CDATA[213123123123]]>
</qewtrweqrqwerqwrqweqw>
<qewtrweqrqwerqwrqzxczxcasxcasxweqw dataFill="ew.fill(\'date\')" dataFillMode="auto" modelCodeMeaning="Data di nascita" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasfafassadasdasdasdas">
<![CDATA[1927-21-13]]>
</qewtrweqrqwerqwrqzxczxcasxcasxweqw>
</modulo>'''
I can do the following (showing only the first six columns here):我可以执行以下操作(此处仅显示前六列):
pd.read_xml(xml,parser='etree')
dataFill![]() |
dataFillMode![]() |
modelCodeMeaning![]() |
modelCodeScheme![]() |
modelCodeSchemeVersion![]() |
modelCodeValue![]() |
|
---|---|---|---|---|---|---|
0 ![]() |
ew.fill() ![]() |
auto![]() |
Nome![]() |
asdasdasdas![]() |
1 ![]() |
asdasdasdasdasdasdqw ![]() |
1 ![]() |
ew.fill() ![]() |
auto![]() |
asdasdasdqweqwe ![]() |
asdasdasdas![]() |
1 ![]() |
asdasdasdasdasd2szszxc ![]() |
2 ![]() |
ew.fill() ![]() |
auto![]() |
sdsdsds ![]() |
asdasdasdas![]() |
1 ![]() |
asdasdasd![]() |
3 ![]() |
ew.fill() ![]() |
auto![]() |
qewtrweqrqwerqwrqweqw ![]() |
asdasdasdas![]() |
1 ![]() |
asdasdasdas![]() |
4 ![]() |
ew.fill('date') ![]() |
auto![]() |
Data di nascita![]() |
asdasdasdas![]() |
1 ![]() |
asdasfafassadasdasdasdas![]() |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.