Let's suppose that I have a dataframe like that:
import pandas as pd
df = pd.DataFrame({'id':['A','A', 'A', 'B','B'], 'value':[2, 4, 6, 3, 4]})
I want to filter this only for id
= A
and keep an x percentage of the rows having id
= A
.
For example if x=60% then the dataframe should look like that:
col1 col2
0 A 2
1 A 4
2 B 3
2 B 4
How can I do this efficiently in pandas
?
Just to clarify that it is not necessary that all the id
=A rows are the one after each other.
One way is using iloc[]
with pd.concat
x = 0.6
cond = df['id'].eq('A')
out = pd.concat((df[cond].iloc[:int(round(df['id'].eq('A').sum() * x))],
df[~cond]),sort=False).sort_index()
id value
0 A 2
1 A 4
3 B 3
4 B 4
You can use df.sample to achieve that easily
ids = ['A']
frac = 0.6
df.groupby('id', group_keys=False).apply(lambda x: x.sample(frac=frac)
if x.name in ids else x)
Out:
id value
1 A 4
0 A 2
3 B 3
4 B 4
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.