简体   繁体   English

根据Pandas的分组依据将一列分为多个

[英]Split a column into multiple ones based on a groupby with Pandas

I've just started using Python and I'm stuck with a problem related to a dataset I'm working with. 我刚开始使用Python,但遇到与我正在使用的数据集有关的问题。

I have the following dataset: 我有以下数据集:

    C1  C2  C3          C4      C5      C6
    99  069 99002068    3348117 3230802 T6
    99  069 99002063    4599974 178885  T4
    99  069 99002063    4599974 4606066 T4
    99  069 99002063    4599974 236346  T4
    99  069 99002063    4599974 310114  T4

I need to group by transpose column C5 into multiple columns based on a group by of columns C1,C2,C3,C4,C6. 我需要根据列C1,C2,C3,C4,C6的分组依据将列C5换组为多个列。

The code I've written so far is the following: 到目前为止,我编写的代码如下:

    # load plugins
    import pandas as pd

    # import CSV
    data = pd.read_csv(
        "C:/Users/mcatuogno/Desktop/lista_collegamenti_onb.csv",
        sep=";",
        header=None,
        dtype=str,
        usecols=[0, 1, 2, 3, 4, 5],
        names=["C1", "C2", "C3", "C4", "C5", "C6"]
    )

    # sort values
    dataSort = data.sort_values(["C1", "C2", "C3", "C4"])

    # transpose column based on group by function
    dataTranspose = dataSort.groupby(["C1", "C2", "C3", "C4", "C6"])["C5"].apply(list)

With the code above the result is 使用上面的代码,结果是

    C1   C2  ...              C6      C5
    99  000  ...  09900000001100      [102995, 102997, 102996]
    99  000  ...  09900000001135      [103042]

I don't know how I can split the column C5 into multiple columns, each with the following name CN_1, CN_2, ..., CN_x. 我不知道如何将列C5拆分为多个列,每个列的名称分别为CN_1,CN_2,...,CN_x。

Which python function can I use? 我可以使用哪个python函数?

Thanks in advance! 提前致谢!

You can create helper Series for count consecutive values per groups by GroupBy.cumcount , add to MultiIndex and reshape by Series.unstack : 您可以创建助手系列,以通过GroupBy.cumcount为每个组计算连续值,添加到MultiIndex并通过Series.unstack重塑Series.unstack

g = dataSort.groupby(["C1", "C2", "C3", "C4", "C6"])["C5"].cumcount()
print (g)
1    0
2    1
3    2
4    3
0    0
dtype: int64

df = (dataSort.set_index(["C1", "C2", "C3", "C4", "C6", g])['C5']
              .unstack()
              .add_prefix('Cn_')
              .reset_index())
print (df)
   C1  C2        C3       C4  C6       Cn_0       Cn_1      Cn_2      Cn_3
0  99  69  99002063  4599974  T4   178885.0  4606066.0  236346.0  310114.0
1  99  69  99002068  3348117  T6  3230802.0        NaN       NaN       NaN

Your solution should be changed for create new DataFrame per constructor: 应该更改您的解决方案,以便为每个构造函数创建新的DataFrame

dataTranspose = dataSort.groupby(["C1", "C2", "C3", "C4", "C6"])["C5"].apply(list)

df = (pd.DataFrame(dataTranspose.values.tolist(), index = dataTranspose.index)
        .add_prefix('Cn_')
        .reset_index())
print (df)
   C1  C2        C3       C4  C6     Cn_0       Cn_1      Cn_2      Cn_3
0  99  69  99002063  4599974  T4   178885  4606066.0  236346.0  310114.0
1  99  69  99002068  3348117  T6  3230802        NaN       NaN       NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM