[英]Apply function to each row in a dataframe
I am trying to apply the following function for each row in a dataframe.我正在尝试为数据框中的每一行应用以下函数。 The dataframe looks as follows:数据框如下所示:
vote_1 vote_2 vote_3 vote_4
a a a b
b b a b
b a a b
I am tring to generate a fourth column to sum the 'votes' of the other columns and produce the winner, as follows:我试图生成第四列来总结其他列的“投票”并产生获胜者,如下所示:
vote_1 vote_2 vote_3 vote_4 winner_columns
a a a b a
b b a b b
b a a b draw
I have currently tried:我目前尝试过:
def winner(x):
a = new_df.iloc[x].value_counts()['a']
b = new_df.iloc[x].value_counts()['b']
if a > b:
y = 'a'
elif a < b:
y = 'b'
else:
y = 'draw'
return y
df['winner_columns'].apply(winner)
However the whole column gets filled with draws.然而,整列都充满了平局。 I assume is something with the way I have build the function but can't figure out what我认为这与我构建函数的方式有关,但无法弄清楚是什么
You can use DataFrame.mode
and count non missing values by DataFrame.count
, if only one use first column else draw
in numpy.where
:您可以使用DataFrame.mode
并通过DataFrame.count
计算非缺失值,如果只有一个使用第一列,否则在numpy.where
draw
:
df1 = df.mode(axis=1)
print (df1)
0 1
0 a NaN
1 b NaN
2 a b
df['winner_columns'] = np.where(df1.count(axis=1).eq(1), df1[0], 'draw')
print (df)
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
Your solution is possible change:您的解决方案可能会发生变化:
def winner(x):
s = x.value_counts()
a = s['a']
b = s['b']
if a > b:
y = 'a'
elif a < b:
y = 'b'
else:
y = 'draw'
return y
df['winner_columns'] = df.apply(winner,axis=1)
print (df)
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
The first problem is that your DataFrame contains sometimes a letter followed by a dot.第一个问题是您的 DataFrame 有时包含一个字母后跟一个点。
So to look for solely 'a'
or 'b'
you have to replace these dots with an empty string, something like:因此,要仅查找'a'
或'b'
您必须用空字符串替换这些点,例如:
df.replace('\.', '', regex=True)
Another problem, which didin't surface in your case, is that a row can contain only 'a'
or 'b'
and your code should be resistant to absence of particular result in such a source row.另一个问题,你的情况,其表面didin't,是一个行只能包含'a'
或'b'
和代码应该是不存在特定结果的耐这种源排。
To make your function resistant to such cases, change it to:为了使您的函数能够抵抗这种情况,请将其更改为:
def winner(row):
vc = row.value_counts()
a = vc.get('a', 0)
b = vc.get('b', 0)
if a > b: return 'a'
elif a < b: return 'b'
else: return 'draw'
Then you can apply your function, but if you want to apply it to each row (not column), you should pass axis=1 .然后你可以应用你的函数,但如果你想将它应用到每一行(而不是列),你应该传递axis=1 。
So, to sum up, change your code to:所以,总而言之,将您的代码更改为:
df['winner_columns'] = df.replace('\.', '', regex=True).apply(winner, axis=1)
The result, for your sample data, is:对于您的示例数据,结果是:
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a. a. a. b a
1 b. b. a b b
2 b. a. a b draw
You can use .sum() for counting the votes, then you save in a list the winners, finally you add into dataframe.您可以使用.sum()计算选票,然后将获胜者保存在列表中,最后添加到数据框中。
numpy_votes = dataframe_votes.to_numpy()
winner_columns = []
for i in numpy_votes:
if np.sum(i == 'a') < np.sum(i == 'b'):
winner_columns.append('b')
elif np.sum(i == 'a') > np.sum(i == 'b'):
winner_columns.append('a')
else:
winner_columns.append('draw')
dataframe_votes['winner_columns'] = winner_columns
Using .sum() method is the fastest way to count elements inside arrays according to this answer.根据此答案,使用 .sum() 方法是计算数组内元素的最快方法。
Output:输出:
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.