简体   繁体   English

Pandas 中按行计算系列值出现次数的有效方法

[英]Efficient way in Pandas to count occurrences of Series of values by row

I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row.我有一个很大的 dataframe,我想按行计算一系列特定值(由外部函数给出)的出现次数。 For reproducibility let's assume the following simplified dataframe:为了再现性,我们假设以下简化的 dataframe:

data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
   A  B  C  D  E
0  3  4  1  1  4
1  2  3  2  1  3
2  1  2  3  2  2
3  0  1  4  2  4

How can I count the number of occurrences of specific values (given by a series with the same size) by row?如何按行计算特定值(由具有相同大小的系列给出)的出现次数?

Again for simplicity, let's assume this value_series is given by the max of each row.再次为简单起见,我们假设此value_series由每行的最大值给出。

values_series = df.max(axis=1)
0    4
1    3
2    3
3    4
dtype: int64

The solution I got to seems not very pythonic (eg I'm using iterrows(), which is slow):我得到的解决方案似乎不是很pythonic(例如我正在使用iterrows(),它很慢):

max_count = []
for index, row in df.iterrows():
    max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)

Is there any more efficient way to do this?有没有更有效的方法来做到这一点?

We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:由于广播,我们可以将转置的df.T直接与df.max系列进行比较:

(df.T == df.max(axis=1)).sum()

# result
0    2
1    1
2    1
3    2
dtype: int64

(Transposing also has the added benefit that we can use sum without specifying the axis, ie with the default axis=0 .) (转置还有一个额外的好处,我们可以在不指定轴的情况下使用sum ,即默认axis=0 。)

You can try你可以试试

df.eq(df.max(1),axis=0).sum(1)
Out[361]: 
0    2
1    1
2    1
3    2
dtype: int64

The perfect job for numpy broadcasting: numpy广播的完美工作:

a = df.to_numpy()
b = values_series.to_numpy()[:, None]

(a == b).sum(axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM