[英]Finding most frequent value from dataframe rows in Pandas
In a data frame, I want to create another column which is outputs the most frequent value coming from different columns in a row.在数据框中,我想创建另一列,该列输出来自一行中不同列的最频繁值。
A B C D
foo bar baz foo
egg bacon egg egg
bacon egg foo baz
The "E" column must output frequent value from a row like “E”列必须是 output 频繁值从一行像
E
foo
egg
How can I do it in Python?如何在 Python 中做到这一点?
Recreating your problem with:重现您的问题:
df = pd.DataFrame(
{
'A' : ['foo', 'egg', 'bacon'],
'B' : ['bar', 'bacon', 'egg'],
'C' : ['baz', 'egg', 'foo'],
'D' : ['foo', 'egg', 'baz']
}
)
And solving the problem with并解决问题
df['E'] = df.mode(axis=1)[0]
Output: Output:
A B C D E
0 foo bar baz foo foo
1 egg bacon egg egg egg
2 bacon egg foo baz bacon
What happens if there is no single most frequent element?如果没有一个最频繁的元素会发生什么?
df.mode(axis=1)
0 1 2 3
0 foo NaN NaN NaN
1 egg NaN NaN NaN
2 bacon baz egg foo
As you can see when there is a tie on being most frequent it returns the values in the most frequent set.正如您所看到的,当出现最频繁时,它会返回最频繁集中的值。 If I swap the values foo for egg and baz for bacon in columns C and D, respectively, we get the following result:
如果我分别在 C 和 D 列中将值 foo 换成鸡蛋,将 baz 换成培根,我们会得到以下结果:
0 1
0 foo NaN
1 egg NaN
2 bacon egg
As you can see, now the result set is only two elements, which means that the tie is between bacon and egg.如您所见,现在结果集只有两个元素,这意味着平局在培根和鸡蛋之间。
How do I detect ties?如何检测关系?
Let us work with the dataset not containing the column D.让我们使用不包含 D 列的数据集。
df
A B C
0 foo bar baz
1 egg bacon egg
2 bacon egg foo
df_m = df.mode(axis=1)
df_m
0 1 2
0 bar baz foo
1 egg NaN NaN
2 bacon egg foo
df['D'] = df_m[0]
A B C D
0 foo bar baz bar
1 egg bacon egg egg
2 bacon egg foo bacon
We can utilize the notna() method which pandas provide to create a mask to check which rows are not containing a NaN value, ie which rows are in a tie.我们可以利用 pandas 提供的notna()方法来创建掩码来检查哪些行不包含 NaN 值,即哪些行处于平局。
First, we must drop the first column which always has a value.首先,我们必须删除始终具有值的第一列。
df_m = df_m.drop(columns=0)
Then we need to transform the dataframe using another method .T , and check for any rows not containing NaNs.然后我们需要使用另一种方法.T转换 dataframe ,并检查任何不包含 NaN 的行。
df_mask = df_m.T.notna().any()
df_mask
0 False
1 False
2 True
dtype: bool
Now we have a pandas series of booleans.现在我们有一个 pandas 系列布尔值。 We can use this mask to overwrite the column from before.
我们可以使用这个掩码覆盖之前的列。
df['D'][df_mask] = df['A'][df_mask]
A B C D
0 foo bar baz foo
1 egg bacon egg egg
2 bacon egg foo bacon
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.