I've asked a question recently, but now I have a new problem. Here is my DataFrame:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 2 0
5 2 0
6 2 0
7 3 1
8 3 1
9 3 0
10 4 1
11 4 1
Now I need to set sex value for id's with mixed sex values. It should be the most frequent value. So i want to get something like this:
id sex
0 1 0
1 1 0
2 1 0
3 1 0
4 2 0
5 2 0
6 2 0
7 3 1
8 3 1
9 3 1
10 4 1
11 4 1
And after that I want to get only one id - sex pair:
id sex
0 1 0
1 2 0
2 3 1
3 4 1
Option 1
You can use groupby
followed by value_counts
and idxmax
.
df = df.set_index('id').groupby(level=0).sex\
.apply(lambda x: x.value_counts().idxmax()).reset_index()
df
id sex
0 1 0
1 2 0
2 3 1
3 4 1
Option 2
Similar to Option 1 , but in 2 steps, using drop_duplicates
df.sex = df.groupby('id').sex.transform(lambda x: x.value_counts().idxmax())
df
id sex
0 1 0
1 1 0
2 1 0
3 1 0
4 2 0
5 2 0
6 2 0
7 3 1
8 3 1
9 3 1
10 4 1
11 4 1
df = df.drop_duplicates()
df
id sex
0 1 0
4 2 0
7 3 1
10 4 1
Use groupby
with value_counts
which sorting by default, so only first index is necesary selected by [0]
:
df = df.groupby('id')['sex'].apply(lambda x: x.value_counts().index[0]).reset_index()
print (df)
id sex
0 1 0
1 2 0
2 3 1
3 4 1
You may use np.bincount
as well.
In [179]: df.groupby('id')['sex'].apply(lambda x: np.argmax(np.bincount(x))).reset_index()
Out[179]:
id sex
0 1 0
1 2 0
2 3 1
3 4 1
Timings
In [194]: df = pd.concat([df]*1000, ignore_index=True)
In [195]: df.shape
Out[195]: (12000, 2)
In [196]: %timeit df.groupby('id')['sex'].apply(lambda x: np.argmax(np.bincount(x))).reset_index()
100 loops, best of 3: 2.48 ms per loop
In [197]: %timeit df.groupby('id')['sex'].apply(lambda x: x.value_counts().index[0]).reset_index()
100 loops, best of 3: 4.55 ms per loop
In [198]: %timeit df.set_index('id').groupby(level=0).sex.apply(lambda x: x.value_counts().idxmax()).reset_index()
100 loops, best of 3: 6.71 ms per loop
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.