简体   繁体   English

将带有xml数据的pandas dataframe列转换为规范化列?

[英]Convert pandas dataframe column with xml data to normalised columns?

I have a DataFrame in pandas , one of whose columns is a XML string. 我在pandas有一个DataFrame ,其中一列是XML字符串。 What I want to do is create one column for each of the xml nodes with column names in a normalised form. 我想要做的是为每个xml节点创建一个列,其列名以规范化形式。 For example, 例如,

    id    xmlcolumn
    1     <main attr1='abc' attr2='xyz'><item><prop1>text1</prop1><prop2>text2</prop2></item></main>
    2     <main ........</main>

I want to convert this to a data frame like so: 我想将其转换为数据框,如下所示:

id   main.attr1  main.attr2 main.item.prop1  main.item.prop2
1       abc        xyz          text1           text2
2      .....

How would I do that, while still keeping the existing columns in the DataFrame ? 我如何保持DataFrame的现有列?

The first step that needs to be done is to convert the XML string to a pandas Series (under the assumption, that there will always be the same amount of columns in the end). 需要完成的第一步是将XML字符串转换为pandas Series (假设在最后总是会有相同数量的列)。 So you need a function like: 所以你需要一个像这样的功能:

def convert_xml(raw):
    # some etree xml mangling

This can be achieved eg using the etree package in python. 这可以通过例如在python中使用etree包来实现。 The returned series must have an index, where each entry in the index is the new column name to appear, eg for your example: 返回的系列必须具有索引,其中索引中的每个条目都是要显示的新列名,例如,对于您的示例:

pd.Series(['abc', 'xyz'], index=['main.attr1', 'main.attr2'])

Given this function, you can do the following with pandas (mocking away the XML mangling): 有了这个函数,您可以使用pandas执行以下操作(模拟XML修改):

frame = pd.DataFrame({'keep': [42], 'xml': '<foo></foo>'})
temp = frame['xml'].apply(convert_xml)
frame = frame.drop('xml', axis=1)
frame = pd.concat([frame, temp], axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM