计算每个单词出现的不同行数

Question

I have a Pandas DataFrame (or a Series, given that I'm just using one column) that contains strings.我有一个包含字符串的 Pandas DataFrame（或一个系列，因为我只使用一列）。 I also have a list of words.我也有一个单词列表。 For each word in this list, I want to check how many different rows it appears in at least once.对于这个列表中的每个单词，我想检查它至少出现在多少不同的行中。 For example:例如：

words = ['hi', 'bye', 'foo', 'bar']
df = pd.Series(["hi hi hi bye foo",
                "bye bye bye bye",
                "bar foo hi bar",
                "hi bye foo bar"])

In this case, the output should be在这种情况下，输出应该是

0   hi      3
1   bye     3
2   foo     3
3   bar     2

Because "hi" appears in three different rows (1st, 3rd and 4th), "bar" appears in two (3rd and 4th), and so on.因为“hi”出现在三个不同的行（1st、3rd 和 4th）中，“bar”出现在两行（3rd 和 4th）中，依此类推。

I came up with the following way to do this:我想出了以下方法来做到这一点：

word_appearances = {}
for word in words:
    appearances = df.str.count(word).clip(upper=1).sum()
    word_appearances.update({word: appearances})

pd.DataFrame(word_appearances.items())

This works fine, but the problem is that I have a rather long list of words (around 40,000), around 30,000 rows to check and strings that are not as short as the ones I used in the example.这工作正常，但问题是我有一个相当长的单词列表（大约 40,000 个），大约 30,000 行要检查，并且字符串不像我在示例中使用的那么短。 When I try my approach with my real data, it takes forever to run.当我用我的真实数据尝试我的方法时，它需要永远运行。 Is there a way to do this in a more efficient way?有没有办法以更有效的方式做到这一点？

Answer 1

Try list comprehension and str.contains and sum尝试列表理解和str.contains和sum

df_out = pd.DataFrame([[word, sum(df.str.contains(word))] for word in words], 
                       columns=['word', 'word_count'])

Out[58]:
  word  word_count
0   hi           3
1  bye           3
2  foo           3
3  bar           2

Answer 2

word_appearances = {}
for word in words:
    appearances = df.str.count(word).clip(upper=1).sum()
    word_appearances[word]= appearances

pd.DataFrame.from_dict(word_appearances,columns=['Frequency'],orient='index')

计算每个单词出现的不同行数

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-01-16 18:36:27

解决方案2
0 2020-01-16 18:36:15

计算每个单词出现的不同行数

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-01-16 18:36:27

解决方案2 0 2020-01-16 18:36:15

解决方案1
2 已采纳 2020-01-16 18:36:27

解决方案2
0 2020-01-16 18:36:15