如何计算两列中任一列的出现次数

Question

I have a simple looking problem. 我有一个简单的问题。 I have a dataframe df with two columns. 我有两列的数据框df 。 For each of the strings that occurs in either of these columns I would like to count the number of rows which has the symbol in either column. 对于出现在这两个列中任一列中的每个字符串，我想计算在任一列中具有符号的行数。

Eg 例如

g k
a h
c i
j e
d i
i h
b b
d d
i a
d h

The following code works but is very inefficient. 以下代码有效，但效率很低。

for elem in set(df.values.flat):
    print elem, len(df.loc[(df[0] == elem) | (df[1] == elem)])


a 2
c 1
b 1
e 1
d 3
g 1
i 4
h 3
k 1
j 1

This is however very inefficient and my dataframe is large. 但是，这效率很低，而且我的数据框很大。 The inefficiency comes from calling df.loc[(df[0] == elem) | (df[1] == elem)] 低效率来自调用df.loc[(df[0] == elem) | (df[1] == elem)] df.loc[(df[0] == elem) | (df[1] == elem)] separately for every distinct symbol in df. df.loc[(df[0] == elem) | (df[1] == elem)]分别针对df中的每个不同符号。

Is there a fast way of doing this? 有快速的方法吗？

Answer 1

You can use loc to filter out row level matches from 'col2' , append the filtered 'col2' values to 'col1' , and then call value_counts : 您可以使用loc过滤掉'col2'行级别匹配项，将过滤后的'col2'值附加到'col1' ，然后调用value_counts ：

counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()

The resulting output: 结果输出：

Note: You can add .sort_index() to the end of the counting code if you want the output to appear in alphabetical order. 注意：如果希望输出按字母顺序显示，则可以在计数代码的末尾添加.sort_index() 。

Timings 计时

Using the following setup to produce a larger sample dataset: 使用以下设置来产生更大的样本数据集：

from string import ascii_lowercase

n = 10**5
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['col1', 'col2'])

def edchum(df):
    vals = np.unique(df.values)
    count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
    return count

I get the following timings: 我得到以下计时：

%timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
10 loops, best of 3: 19.7 ms per loop

%timeit edchum(df)
1 loop, best of 3: 3.81 s per loop

Answer 2

OK this is much trickier than I thought, not sure how this will scale but if you have a lot of repeating values then it will be more efficient than your current method, basically we can use str.get_dummies and reindex the columns from that result to generate a dummies df for all unique values, we can then use np.maximal on the 2 dfs and sum these: 好的，这比我想象的要难得多，不确定如何缩放，但是如果您有很多重复的值，那么它将比您当前的方法更有效，基本上，我们可以使用str.get_dummies并将该结果中的列重新索引为为所有唯一值生成一个虚拟的df，然后我们可以在2个np.maximal上使用np.maximal并将它们sum ：

In [77]:
t="""col1 col2
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h"""
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()

Out[77]:
a    2
b    1
c    1
d    3
e    1
g    1
h    3
i    4
j    1
k    1
dtype: float64

vals here is just the unique values: vals只是唯一值：

In [80]:
vals = np.unique(df.values)
vals

Out[80]:
array(['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j', 'k'], dtype=object)

如何计算两列中任一列的出现次数

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-09-12 20:18:43

解决方案2
1 2016-09-12 13:26:07

如何计算两列中任一列的出现次数

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-09-12 20:18:43

解决方案2 1 2016-09-12 13:26:07

解决方案1
2 已采纳 2016-09-12 20:18:43

解决方案2
1 2016-09-12 13:26:07