[英]Efficient way in Pandas to count occurrences of Series of values by row
I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row.我有一个很大的 dataframe,我想按行计算一系列特定值(由外部函数给出)的出现次数。 For reproducibility let's assume the following simplified dataframe:为了再现性,我们假设以下简化的 dataframe:
data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
A B C D E
0 3 4 1 1 4
1 2 3 2 1 3
2 1 2 3 2 2
3 0 1 4 2 4
How can I count the number of occurrences of specific values (given by a series with the same size) by row?如何按行计算特定值(由具有相同大小的系列给出)的出现次数?
Again for simplicity, let's assume this value_series
is given by the max of each row.再次为简单起见,我们假设此value_series
由每行的最大值给出。
values_series = df.max(axis=1)
0 4
1 3
2 3
3 4
dtype: int64
The solution I got to seems not very pythonic (eg I'm using iterrows(), which is slow):我得到的解决方案似乎不是很pythonic(例如我正在使用iterrows(),它很慢):
max_count = []
for index, row in df.iterrows():
max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)
Is there any more efficient way to do this?有没有更有效的方法来做到这一点?
We can compare the transposed df.T
directly to the df.max
series, thanks to broadcasting:由于广播,我们可以将转置的df.T
直接与df.max
系列进行比较:
(df.T == df.max(axis=1)).sum()
# result
0 2
1 1
2 1
3 2
dtype: int64
(Transposing also has the added benefit that we can use sum
without specifying the axis, ie with the default axis=0
.) (转置还有一个额外的好处,我们可以在不指定轴的情况下使用sum
,即默认axis=0
。)
You can try你可以试试
df.eq(df.max(1),axis=0).sum(1)
Out[361]:
0 2
1 1
2 1
3 2
dtype: int64
The perfect job for numpy broadcasting: numpy广播的完美工作:
a = df.to_numpy()
b = values_series.to_numpy()[:, None]
(a == b).sum(axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.