简体   繁体   English

如何计算两列中任一列的出现次数

[英]How to count the number of occurrences in either of two columns

I have a simple looking problem. 我有一个简单的问题。 I have a dataframe df with two columns. 我有两列的数据框df For each of the strings that occurs in either of these columns I would like to count the number of rows which has the symbol in either column. 对于出现在这两个列中任一列中的每个字符串,我想计算在任一列中具有符号的行数。

Eg 例如

g k
a h
c i
j e
d i
i h
b b
d d
i a
d h

The following code works but is very inefficient. 以下代码有效,但效率很低。

for elem in set(df.values.flat):
    print elem, len(df.loc[(df[0] == elem) | (df[1] == elem)])


a 2
c 1
b 1
e 1
d 3
g 1
i 4
h 3
k 1
j 1

This is however very inefficient and my dataframe is large. 但是,这效率很低,而且我的数据框很大。 The inefficiency comes from calling df.loc[(df[0] == elem) | (df[1] == elem)] 低效率来自调用df.loc[(df[0] == elem) | (df[1] == elem)] df.loc[(df[0] == elem) | (df[1] == elem)] separately for every distinct symbol in df. df.loc[(df[0] == elem) | (df[1] == elem)]分别针对df中的每个不同符号。

Is there a fast way of doing this? 有快速的方法吗?

You can use loc to filter out row level matches from 'col2' , append the filtered 'col2' values to 'col1' , and then call value_counts : 您可以使用loc过滤掉'col2'行级别匹配项,将过滤后的'col2'值附加到'col1' ,然后调用value_counts

counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()

The resulting output: 结果输出:

i    4
d    3
h    3
a    2
j    1
k    1
c    1
g    1
b    1
e    1

Note: You can add .sort_index() to the end of the counting code if you want the output to appear in alphabetical order. 注意:如果希望输出按字母顺序显示,则可以在计数代码的末尾添加.sort_index()

Timings 计时

Using the following setup to produce a larger sample dataset: 使用以下设置来产生更大的样本数据集:

from string import ascii_lowercase

n = 10**5
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['col1', 'col2'])

def edchum(df):
    vals = np.unique(df.values)
    count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
    return count

I get the following timings: 我得到以下计时:

%timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
10 loops, best of 3: 19.7 ms per loop

%timeit edchum(df)
1 loop, best of 3: 3.81 s per loop

OK this is much trickier than I thought, not sure how this will scale but if you have a lot of repeating values then it will be more efficient than your current method, basically we can use str.get_dummies and reindex the columns from that result to generate a dummies df for all unique values, we can then use np.maximal on the 2 dfs and sum these: 好的,这比我想象的要难得多,不确定如何缩放,但是如果您有很多重复的值,那么它将比您当前的方法更有效,基本上,我们可以使用str.get_dummies并将该结果中的列重新索引为为所有唯一值生成一个虚拟的df,然后我们可以在2个np.maximal上使用np.maximal并将它们sum

In [77]:
t="""col1 col2
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h"""
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()

Out[77]:
a    2
b    1
c    1
d    3
e    1
g    1
h    3
i    4
j    1
k    1
dtype: float64

vals here is just the unique values: vals只是唯一值:

In [80]:
vals = np.unique(df.values)
vals

Out[80]:
array(['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j', 'k'], dtype=object)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM