简体   繁体   English

用字符串条目聚合熊猫数据框

[英]Aggregate pandas dataframe with string entries

I have a dataframe of the following form 我有以下形式的数据框

df = pd.DataFrame({'Start':['47q2',None, None,'49q1',None,None],
              'Threshold':[None, '47q3', None,None, '49q2', None],
              'End':[None, None, '48q1',None, None, '50q2'],
              'Series':['S1','S1','S1','S2','S2','S2']})

    End Series Start Threshold
0  None     S1  47q2      None
1  None     S1  None      47q3
2  48q1     S1  None      None
3  None     S2  49q1      None
4  None     S2  None      49q2
5  50q2     S2  None      None

I want to reshape the dataframe so that I have the information 我想重塑数据框,以便获得信息

df_wanted = pd.DataFrame({'Start':['47q2','49q1'],
              'Threshold':['47q3','49q2'],
              'End':['48q1','50q2'],
              'Series':['S1','S2']})

    End Series Start Threshold
0  48q1     S1  47q2      47q3
1  50q2     S2  49q1      49q2

That is, I'd like each Series to take up just one row, and have the information about start, end and threshold in the other columns. 也就是说,我希望每个系列仅占用一行,而在其他列中提供有关开始,结束和阈值的信息。

I tried using groupby and agg - however as they are strings I couldn't get this working. 我尝试使用groupby和agg-但是由于它们是字符串,因此无法正常工作。 I'm unsure what sort of function could achieve this. 我不确定哪种功能可以实现此目的。

I am unsure if it makes any difference, this dataframe is contructed from another, which has None entries - however this dataframe is showing as NaN (but I don't know how to reproduce that as an example). 我不确定是否有任何区别,此数据帧是由另一个没有任何条目的结构构成的-但是,此数据帧显示为NaN(但我不知道如何重现该示例)。

Option 1 选项1
Use groupby + first . first使用groupby +。

df.groupby('Series', as_index=False).first()

  Series   End Start Threshold
0     S1  48q1  47q2      47q3
1     S2  50q2  49q1      49q2

Option 2 选项2
A slower solution using groupby + apply . 使用groupby + apply较慢解决方案。

df.groupby('Series').apply(lambda x: x.bfill().ffill()).drop_duplicates()

    End Series Start Threshold
0  48q1     S1  47q2      47q3
3  50q2     S2  49q1      49q2

The apply logic fills holes, and the final drop_duplicates call drops redundant rows. 应用逻辑填补了drop_duplicates ,最后的drop_duplicates调用删除了多余的行。

set_index + stack set_index + stack

df.set_index('Series').stack().unstack().reset_index()
Out[790]: 
  Series   End Start Threshold
0     S1  48q1  47q2      47q3
1     S2  50q2  49q1      49q2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM