简体   繁体   English

将第一行转换为列并简化 pandas dataframe 中的重复列

[英]Transpose first row into column and simplify repeating columns in a pandas dataframe

Despite spending half the day on Stack Overflow, I have not found a solution.尽管在 Stack Overflow 上花了半天时间,但我还没有找到解决方案。 Working in python 3.9.0, I need to clean a dataframe.在 python 3.9.0 中工作,我需要清理 dataframe。 The first row should be transposed into a column, the second row needs to be made a header, and the repeating columns ('political_rights', 'civil_liberties, 'status') need to be simplified into only 3 columns.第一行要转成一列,第二行需要做成header,重复的列('political_rights','civil_liberties','status')只需简化为3列。 This can be done by making the values in the column "country" repeat for each year.这可以通过使“国家”列中的值每年重复来完成。 Whenever I accomplish one thing, I mess up another so any help/advice is deeply appreciated!每当我完成一件事时,我就会把另一件事搞砸,所以任何帮助/建议都非常感谢!

Simiplified version of current dataframe (actual df: 207 rows × 148 columns):当前dataframe的简化版(实际df:207行×148列):

df_bad = pd.DataFrame({'col1': ['years', 'country', 'Afghanistan', 'Albania', 'Algeria', 'Andorra'],
                       'col2': [1972, 'political_rights', 4, 7, 6, 4], 
                       'col3': [1972, 'civil_liberties', 5, 7, 6, 3],
                       'col4': [1972, 'status', 'PF', 'NF', 'NF', 'NF'],
                       'col5': [1973, 'political_rights', 7, 7, 6, 4],
                       'col6': [1973, 'civil_liberties', 6, 7, 6, 4],
                       'col7': [1973, 'status', 'NF', 'NF', 'NF', 'PF']})

Simiplified version of desired dataframe (future df: 10250 rows × 5 columns):所需 dataframe 的简化版本(未来 df:10250 行 × 5 列):

df = pd.DataFrame({'country': ['Afghanistan', 'Albania', 'Algeria', 'Afghanistan',  'Albania', 'Algeria'],
                   'years': [1972, 1972, 1972, 1973, 1973, 1973], 
                   'political_rights': [4, 7, 6, 7, 7, 6],
                   'civil_liberties': [5, 7, 6, 6, 7, 6],
                   'status': ['PF', 'NF', 'NF', 'NF', 'NF', 'NF']})

Solution解决方案

s = df_bad.T
s.columns = s.loc['col1']
s = s.drop('col1').set_index(['years', 'country'])
s = s.stack().rename_axis(['years', None, 'country'])
s = s.unstack(1).reset_index()

Explained解释

Transpose the dataframe转置 dataframe

          0                 1            2        3        4        5
col1  years           country  Afghanistan  Albania  Algeria  Andorra
col2   1972  political_rights            4        7        6        4
col3   1972   civil_liberties            5        7        6        3
col4   1972            status           PF       NF       NF       NF
col5   1973  political_rights            7        7        6        4
col6   1973   civil_liberties            6        7        6        4
col7   1973            status           NF       NF       NF       PF

Set the columns to col1 values, then drop col1 and set the index to years and country将列设置为col1值,然后drop col1并将索引设置为yearscountry

col1                   Afghanistan Albania Algeria Andorra
years country                                             
1972  political_rights           4       7       6       4
      civil_liberties            5       7       6       3
      status                    PF      NF      NF      NF
1973  political_rights           7       7       6       4
      civil_liberties            6       7       6       4
      status                    NF      NF      NF      PF

Stack the dataframe to reshape into multiindex series then rename axis Stack dataframe 以重塑为多索引系列,然后重命名轴

years                    country    
1972   political_rights  Afghanistan     4
                         Albania         7
                         Algeria         6
                         Andorra         4
       civil_liberties   Afghanistan     5
                         Albania         7
                         Algeria         6
                         Andorra         3
       status            Afghanistan    PF
                         Albania        NF
                         Algeria        NF
                         Andorra        NF
1973   political_rights  Afghanistan     7
                         Albania         7
                         Algeria         6
                         Andorra         4
       civil_liberties   Afghanistan     6
                         Albania         7
                         Algeria         6
                         Andorra         4
       status            Afghanistan    NF
                         Albania        NF
                         Algeria        NF
                         Andorra        PF
dtype: object

Unstack the series on level=1 to reshape back to dataframelevel=1上取消堆叠系列以重塑回Unstack

   years      country civil_liberties political_rights status
0   1972  Afghanistan               5                4     PF
1   1972      Albania               7                7     NF
2   1972      Algeria               6                6     NF
3   1972      Andorra               3                4     NF
4   1973  Afghanistan               6                7     NF
5   1973      Albania               7                7     NF
6   1973      Algeria               6                6     NF
7   1973      Andorra               4                4     PF

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM