简体   繁体   中英

Choose between two values and set the most frequent in a pandas dataframe

I've asked a question recently, but now I have a new problem. Here is my DataFrame:

df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
              'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})

    id  sex
0   1   0
1   1   0
2   1   0
3   1   1
4   2   0
5   2   0
6   2   0
7   3   1
8   3   1
9   3   0
10  4   1
11  4   1

Now I need to set sex value for id's with mixed sex values. It should be the most frequent value. So i want to get something like this:

    id  sex
0   1   0
1   1   0
2   1   0
3   1   0
4   2   0
5   2   0
6   2   0
7   3   1
8   3   1
9   3   1
10  4   1
11  4   1

And after that I want to get only one id - sex pair:

id  sex
0   1   0
1   2   0
2   3   1
3   4   1

Option 1
You can use groupby followed by value_counts and idxmax .

df = df.set_index('id').groupby(level=0).sex\
          .apply(lambda x: x.value_counts().idxmax()).reset_index()
df

   id  sex
0   1    0
1   2    0
2   3    1
3   4    1

Option 2
Similar to Option 1 , but in 2 steps, using drop_duplicates

df.sex = df.groupby('id').sex.transform(lambda x: x.value_counts().idxmax())
df

    id  sex
0    1    0
1    1    0
2    1    0
3    1    0
4    2    0
5    2    0
6    2    0
7    3    1
8    3    1
9    3    1
10   4    1
11   4    1

df = df.drop_duplicates()
df

    id  sex
0    1    0
4    2    0
7    3    1
10   4    1

Use groupby with value_counts which sorting by default, so only first index is necesary selected by [0] :

df = df.groupby('id')['sex'].apply(lambda x: x.value_counts().index[0]).reset_index()
print (df)
   id  sex
0   1    0
1   2    0
2   3    1
3   4    1

You may use np.bincount as well.

In [179]: df.groupby('id')['sex'].apply(lambda x: np.argmax(np.bincount(x))).reset_index()
Out[179]:
   id  sex
0   1    0
1   2    0
2   3    1
3   4    1

Timings

In [194]: df = pd.concat([df]*1000, ignore_index=True)

In [195]: df.shape
Out[195]: (12000, 2)

In [196]: %timeit df.groupby('id')['sex'].apply(lambda x: np.argmax(np.bincount(x))).reset_index()
100 loops, best of 3: 2.48 ms per loop

In [197]: %timeit df.groupby('id')['sex'].apply(lambda x: x.value_counts().index[0]).reset_index()
100 loops, best of 3: 4.55 ms per loop

In [198]: %timeit df.set_index('id').groupby(level=0).sex.apply(lambda x: x.value_counts().idxmax()).reset_index()
100 loops, best of 3: 6.71 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM