简体   繁体   中英

count occurences in each dataframe row then create column with most frequent

I am trying to compare the three floats in a row of a dataframe that is (500000x3), I expect the three values to be the same or at least 2 of them. I want to select the value that occurs the most under the presumption that they are not all different. My current attempt with a toy example is like thus:

mydf
   a  b  c
0  1  1  2
1  3  3  3
2  1  3  3
3  4  5  4
3  4  5  5



mydft = mydf.transpose()
    counts=[]
    for col in mydft:
        counts.append(mydft[col].value_counts())

I am then thinking of looping over counts and selecting the top value for each but this is very slow and feels anti pandas. I have also tried this:

truth = mydf['a'] == mydf['b']

with the intention of keeping rows which evaluate to true and doing something to those that do not but I have 1000s of NaN values in the real thing and apparently NaN == NaN is False . Any suggestions?

We can use mode ...

from scipy import stats


value,count=stats.mode(df.values,axis=1)
value
Out[180]: 
array([[1],
       [3],
       [3],
       [4],
       [5]], dtype=int64)


count
Out[181]: 
array([[2],
       [3],
       [2],
       [2],
       [2]])

After assign it back

df['new']=value
df
Out[183]: 
   a  b  c  new
0  1  1  2    1
1  3  3  3    3
2  1  3  3    3
3  4  5  4    4
3  4  5  5    5

Here's a fast approach I learnt from @coldspeed ie

dummies = pd.get_dummies(df.astype(str)).groupby(by=lambda x: x.split('_')[1], axis=1).sum()

df['new'] = dummies.idxmax(1)

   a  b  c new
0  1  1  2   1
1  3  3  3   3
2  1  3  3   3
3  4  5  4   4
3  4  5  5   5

Explanation :

We can get the one hot encoding of the items present in each column using pd.get_dummies , since get_dummies wont take numbers we have to convert them to strings.

pd.get_dummies(df.astype(str))

   a_1  a_3  a_4  b_1  b_3  b_5  c_2  c_3  c_4  c_5
0    1    0    0    1    0    0    1    0    0    0
1    0    1    0    0    1    0    0    1    0    0
2    1    0    0    0    1    0    0    1    0    0
3    0    0    1    0    0    1    0    0    1    0
3    0    0    1    0    0    1    0    0    0    1

Now if you group only the numbers in the column and sum them we can get the value counts for each row. ie

   1  2  3  4  5
0  2  1  0  0  0
1  0  0  3  0  0
2  1  0  2  0  0
3  0  0  0  2  1
3  0  0  0  1  2

Using idxmax(axis=1) on the one hot encoding will get you the column names which is the required maximum repeated number in the row.

0    1
1    3
2    3
3    4
3    5
dtype: object

Edit :

If you have strings in your dataframe, then go for get_dummies that would be faster than anything, if you have numbers then you have to go for scipy mode or pandas mode

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM