I am trying to compare the three floats in a row of a dataframe that is (500000x3), I expect the three values to be the same or at least 2 of them. I want to select the value that occurs the most under the presumption that they are not all different. My current attempt with a toy example is like thus:
mydf
a b c
0 1 1 2
1 3 3 3
2 1 3 3
3 4 5 4
3 4 5 5
mydft = mydf.transpose()
counts=[]
for col in mydft:
counts.append(mydft[col].value_counts())
I am then thinking of looping over counts and selecting the top value for each but this is very slow and feels anti pandas. I have also tried this:
truth = mydf['a'] == mydf['b']
with the intention of keeping rows which evaluate to true and doing something to those that do not but I have 1000s of NaN values in the real thing and apparently NaN == NaN
is False
. Any suggestions?
We can use mode
...
from scipy import stats
value,count=stats.mode(df.values,axis=1)
value
Out[180]:
array([[1],
[3],
[3],
[4],
[5]], dtype=int64)
count
Out[181]:
array([[2],
[3],
[2],
[2],
[2]])
After assign it back
df['new']=value
df
Out[183]:
a b c new
0 1 1 2 1
1 3 3 3 3
2 1 3 3 3
3 4 5 4 4
3 4 5 5 5
Here's a fast approach I learnt from @coldspeed ie
dummies = pd.get_dummies(df.astype(str)).groupby(by=lambda x: x.split('_')[1], axis=1).sum()
df['new'] = dummies.idxmax(1)
a b c new
0 1 1 2 1
1 3 3 3 3
2 1 3 3 3
3 4 5 4 4
3 4 5 5 5
Explanation :
We can get the one hot encoding of the items present in each column using pd.get_dummies
, since get_dummies wont take numbers we have to convert them to strings.
pd.get_dummies(df.astype(str))
a_1 a_3 a_4 b_1 b_3 b_5 c_2 c_3 c_4 c_5
0 1 0 0 1 0 0 1 0 0 0
1 0 1 0 0 1 0 0 1 0 0
2 1 0 0 0 1 0 0 1 0 0
3 0 0 1 0 0 1 0 0 1 0
3 0 0 1 0 0 1 0 0 0 1
Now if you group only the numbers in the column and sum them we can get the value counts for each row. ie
1 2 3 4 5
0 2 1 0 0 0
1 0 0 3 0 0
2 1 0 2 0 0
3 0 0 0 2 1
3 0 0 0 1 2
Using idxmax(axis=1)
on the one hot encoding will get you the column names which is the required maximum repeated number in the row.
0 1
1 3
2 3
3 4
3 5
dtype: object
Edit :
If you have strings in your dataframe, then go for get_dummies
that would be faster than anything, if you have numbers then you have to go for scipy mode
or pandas mode
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.