I have a large dataset with messy data. The data looks like this:
df1 = pd.DataFrame({'Batch':[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'Case':[1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2],
'Live':['Yes', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Task':['Download', nan, 'Download', 'Report', 'Report', nan, 'Download', nan, nan, nan, 'Download', 'Download', 'Report', nan, 'Report']
})
For the purpose of the example, please imagine that the 'nan' is actually an empty cell (not a string saying 'nan')
I need to group by 'Batch', then group by 'Case', filter for instances where 'Live' has the value 'Yes' then fill downwards.
I essentially want it to look something like this
My current approach has been:
df['Task'] = df.groupby(['Batch','Case'])['Live'].filter(lambda x: x == 'Yes')['Task'].fillna(method='ffill')
I've tried a number of variations, but I keep getting errors like "the filter must return a boolean result"
Does anyone know how I can go about doing this?
You do not need to filter
, you can slice the Yes of live before groupby
df1.Task=df1.loc[df1.Live=='Yes'].groupby(['Batch','Case']).Task.ffill()
df1
Out[620]:
Batch Case Live Task
0 1 1 Yes Download
1 1 1 Yes Download
2 1 1 No NaN
3 1 2 Yes Report
4 1 2 No NaN
5 1 2 No NaN
6 1 2 Yes Download
7 1 2 Yes Download
8 1 2 Yes Download
9 2 1 Yes NaN
10 2 1 Yes Download
11 2 1 No NaN
12 2 2 Yes Report
13 2 2 Yes Report
14 2 2 No NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.