简体   繁体   中英

Expanding XML data column in Pandas dataframe and save it as new columns

I have (11145, 14) shape dataset. In one of the column, I have a really complicated XML values. I am trying to expand this XML column and add them as new columns. Here is one example of this XML: ( i changed the values for privacy reason but this is the structure)

'
<?xml version="1.0" encoding="UTF-8"?>
<modulo
    xmlns="http://www.sadasdasdasdasd.it/12312312312/Fasdasdasda"
    xmlns:xsi="http://www.sss1231231233.org/200232321/XMLSchema-instance">
    <nomeTxt dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="Nome" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdasdasdasdqw">
        <![CDATA[*****]]>
    </nomeTxt>
    <adasdasdasdaq2qwdwasxasxas dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="asdasdasdqweqwe" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdasdasd2szszxc">
        <![CDATA[*****]]>
    </adasdasdasdaq2qwdwasxasxas>
    <qweweqweqweqweqweqwe dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="sdsdsds" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasd">
        <![CDATA[M]]>
    </qweweqweqweqweqweqwe>
    <qewtrweqrqwerqwrqweqw dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="qewtrweqrqwerqwrqweqw" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdas">
        <![CDATA[213123123123]]>
    </qewtrweqrqwerqwrqweqw>
    <qewtrweqrqwerqwrqzxczxcasxcasxweqw dataFill="ew.fill(\'date\')" dataFillMode="auto" modelCodeMeaning="Data di nascita" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasfafassadasdasdasdas">
        <![CDATA[1927-21-13]]>
    </qewtrweqrqwerqwrqzxczxcasxcasxweqw>
    <sadasdasdasdasdsa codeValue="0" codeScheme="asdasdasdasdasdasd" codeMeaning="No" codeSchemeVersion="01">
        <![CDATA[No]]>
    </rbg_allergiefarmacologiche>
    <xczcxzcxzczxczxcz codeValue="0" codeScheme="asdasdasdasdasdasd" codeMeaning="No" codeSchemeVersion="01">
        <![CDATA[No]]>
    </xczcxzcxzczxczxcz>
    <asdasfascasasxasx codeValue="0" codeScheme="asdasdasdas" codeMeaning="No" codeSchemeVersion="01">
        <![CDATA[No]]>
    </asdasfascasasxasx>
    <asdasxcasxasxasxzxxz>
        <![CDATA[false]]>
    </asdasxcasxasxasxzxxz>
    <asxasxasxsaxasx xsi:nil="true"></asxasxasxsaxasx>
    <saxasx>
        <![CDATA[false]]>
    </saxasx>
    <asdasxasxasxas xsi:nil="true"></asdasxasxasxas>
    <asasdasdasdas>
        <![CDATA[false]]>
    </asasdasdasdas>
    <asasdasdasdasasasasd xsi:nil="true"></asasdasdasdasasasasd>
    <asasdasdasasd>
        <![CDATA[false]]>
    </asasdasdasasd>
    <zcxzcxzc xsi:nil="true"></zcxzcxzc>
</modulo>'

I tried to search each column with for loop and and then tried to convert it as dictionary and then save it as columns. The problem with this solution, in each row there are different <xml columns and number of them are different. So my solution is not working.

df["XML_column"]
0        <?xml version="1.0" encoding="UTF-8"?><modulo ...
1        <?xml version="1.0" encoding="UTF-8"?><modulo ...
2        <?xml version="1.0" encoding="UTF-8"?><modulo ...
3        <?xml version="1.0" encoding="UTF-8"?><modulo ...
4        <?xml version="1.0" encoding="UTF-8"?><modulo ...
                               ...                        
11140    <?xml version="1.0" encoding="UTF-8"?><modulo ...
11141    <?xml version="1.0" encoding="UTF-8"?><modulo ...
11142    <?xml version="1.0" encoding="UTF-8"?><modulo ...
11143    <?xml version="1.0" encoding="UTF-8"?><modulo ...
11144    <?xml version="1.0" encoding="UTF-8"?><modulo ...

Welcome. Your XML seems to be a bit bumpy. If I take a clean fragment, for instance this:

xml = '''<?xml version="1.0" encoding="UTF-8"?>
<modulo
    xmlns="http://www.sadasdasdasdasd.it/12312312312/Fasdasdasda"
    xmlns:xsi="http://www.sss1231231233.org/200232321/XMLSchema-instance">
    <nomeTxt dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="Nome" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdasdasdasdqw">
        <![CDATA[*****]]>
    </nomeTxt>
    <adasdasdasdaq2qwdwasxasxas dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="asdasdasdqweqwe" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdasdasd2szszxc">
        <![CDATA[*****]]>
    </adasdasdasdaq2qwdwasxasxas>
    <qweweqweqweqweqweqwe dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="sdsdsds" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasd">
        <![CDATA[M]]>
    </qweweqweqweqweqweqwe>
    <qewtrweqrqwerqwrqweqw dataFill="ew.fill()" dataFillMode="auto" modelCodeMeaning="qewtrweqrqwerqwrqweqw" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasdasdas">
        <![CDATA[213123123123]]>
    </qewtrweqrqwerqwrqweqw>
    <qewtrweqrqwerqwrqzxczxcasxcasxweqw dataFill="ew.fill(\'date\')" dataFillMode="auto" modelCodeMeaning="Data di nascita" modelCodeScheme="asdasdasdas" modelCodeSchemeVersion="01" modelCodeValue="asdasfafassadasdasdasdas">
        <![CDATA[1927-21-13]]>
    </qewtrweqrqwerqwrqzxczxcasxcasxweqw>
</modulo>'''

I can do the following (showing only the first six columns here):

pd.read_xml(xml,parser='etree')
dataFill dataFillMode modelCodeMeaning modelCodeScheme modelCodeSchemeVersion modelCodeValue
0 ew.fill() auto Nome asdasdasdas 1 asdasdasdasdasdasdqw
1 ew.fill() auto asdasdasdqweqwe asdasdasdas 1 asdasdasdasdasd2szszxc
2 ew.fill() auto sdsdsds asdasdasdas 1 asdasdasd
3 ew.fill() auto qewtrweqrqwerqwrqweqw asdasdasdas 1 asdasdasdas
4 ew.fill('date') auto Data di nascita asdasdasdas 1 asdasfafassadasdasdasdas

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM