[英]How to count the number of occurrences in either of two columns
I have a simple looking problem. 我有一个简单的问题。 I have a dataframe df
with two columns. 我有两列的数据框df
。 For each of the strings that occurs in either of these columns I would like to count the number of rows which has the symbol in either column. 对于出现在这两个列中任一列中的每个字符串,我想计算在任一列中具有符号的行数。
Eg 例如
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h
The following code works but is very inefficient. 以下代码有效,但效率很低。
for elem in set(df.values.flat):
print elem, len(df.loc[(df[0] == elem) | (df[1] == elem)])
a 2
c 1
b 1
e 1
d 3
g 1
i 4
h 3
k 1
j 1
This is however very inefficient and my dataframe is large. 但是,这效率很低,而且我的数据框很大。 The inefficiency comes from calling df.loc[(df[0] == elem) | (df[1] == elem)]
低效率来自调用df.loc[(df[0] == elem) | (df[1] == elem)]
df.loc[(df[0] == elem) | (df[1] == elem)]
separately for every distinct symbol in df. df.loc[(df[0] == elem) | (df[1] == elem)]
分别针对df中的每个不同符号。
Is there a fast way of doing this? 有快速的方法吗?
You can use loc
to filter out row level matches from 'col2'
, append the filtered 'col2'
values to 'col1'
, and then call value_counts
: 您可以使用loc
过滤掉'col2'
行级别匹配项,将过滤后的'col2'
值附加到'col1'
,然后调用value_counts
:
counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
The resulting output: 结果输出:
i 4
d 3
h 3
a 2
j 1
k 1
c 1
g 1
b 1
e 1
Note: You can add .sort_index()
to the end of the counting code if you want the output to appear in alphabetical order. 注意:如果希望输出按字母顺序显示,则可以在计数代码的末尾添加.sort_index()
。
Timings 计时
Using the following setup to produce a larger sample dataset: 使用以下设置来产生更大的样本数据集:
from string import ascii_lowercase
n = 10**5
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['col1', 'col2'])
def edchum(df):
vals = np.unique(df.values)
count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
return count
I get the following timings: 我得到以下计时:
%timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
10 loops, best of 3: 19.7 ms per loop
%timeit edchum(df)
1 loop, best of 3: 3.81 s per loop
OK this is much trickier than I thought, not sure how this will scale but if you have a lot of repeating values then it will be more efficient than your current method, basically we can use str.get_dummies
and reindex the columns from that result to generate a dummies df for all unique values, we can then use np.maximal
on the 2 dfs and sum
these: 好的,这比我想象的要难得多,不确定如何缩放,但是如果您有很多重复的值,那么它将比您当前的方法更有效,基本上,我们可以使用str.get_dummies
并将该结果中的列重新索引为为所有唯一值生成一个虚拟的df,然后我们可以在2个np.maximal
上使用np.maximal
并将它们sum
:
In [77]:
t="""col1 col2
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h"""
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
Out[77]:
a 2
b 1
c 1
d 3
e 1
g 1
h 3
i 4
j 1
k 1
dtype: float64
vals here is just the unique values: vals只是唯一值:
In [80]:
vals = np.unique(df.values)
vals
Out[80]:
array(['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j', 'k'], dtype=object)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.