I have a dataframe of the following form
df = pd.DataFrame({'Start':['47q2',None, None,'49q1',None,None],
'Threshold':[None, '47q3', None,None, '49q2', None],
'End':[None, None, '48q1',None, None, '50q2'],
'Series':['S1','S1','S1','S2','S2','S2']})
End Series Start Threshold
0 None S1 47q2 None
1 None S1 None 47q3
2 48q1 S1 None None
3 None S2 49q1 None
4 None S2 None 49q2
5 50q2 S2 None None
I want to reshape the dataframe so that I have the information
df_wanted = pd.DataFrame({'Start':['47q2','49q1'],
'Threshold':['47q3','49q2'],
'End':['48q1','50q2'],
'Series':['S1','S2']})
End Series Start Threshold
0 48q1 S1 47q2 47q3
1 50q2 S2 49q1 49q2
That is, I'd like each Series to take up just one row, and have the information about start, end and threshold in the other columns.
I tried using groupby and agg - however as they are strings I couldn't get this working. I'm unsure what sort of function could achieve this.
I am unsure if it makes any difference, this dataframe is contructed from another, which has None entries - however this dataframe is showing as NaN (but I don't know how to reproduce that as an example).
Option 1
Use groupby
+ first
.
df.groupby('Series', as_index=False).first()
Series End Start Threshold
0 S1 48q1 47q2 47q3
1 S2 50q2 49q1 49q2
Option 2
A slower solution using groupby
+ apply
.
df.groupby('Series').apply(lambda x: x.bfill().ffill()).drop_duplicates()
End Series Start Threshold
0 48q1 S1 47q2 47q3
3 50q2 S2 49q1 49q2
The apply logic fills holes, and the final drop_duplicates
call drops redundant rows.
set_index
+ stack
df.set_index('Series').stack().unstack().reset_index()
Out[790]:
Series End Start Threshold
0 S1 48q1 47q2 47q3
1 S2 50q2 49q1 49q2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.