计算一个单词在n条推文中出现的次数

Question

我的数据帧大约有118,000条推文。 这是一个组成的样本：

Tweets
1 The apple is red
2 The grape is purple
3 The tree is green

我还使用了“设置”功能来获得在我的推文数据框中找到的每个唯一单词的列表。 对于上面的示例，它看起来像这样（无特定顺序）：

Words
1 The
2 is
3 apple
4 grape 
....so on

基本上，我需要找出包含给定单词的推文数量。 例如，在3条推文中找到“ The”，在1条推文中找到“ apple”，在3条推文中找到“ is”，依此类推。

我尝试过使用嵌套的for循环，如下所示：

number_words = [0]*len(words)
for i in range(len(words)):
    for j in range(len(tweets)):
        if words[i] in tweets[j]:
            number_words[i] += 1
number_words

这将创建一个新列表，并为列表中的每个单词计算包含给定单词的推文数量。 但是，我发现这种极其低效的代码块需要永远运行。

有什么更好的方法可以做到这一点？

Answer 1

您可以使用： str.count

df.Tweets.str.count(word).sum()

例如，我想单词是列表

for word in Words:
    print(f'{word} count: {df.Tweets.str.count(word).sum()}')

完整样本：

import pandas as pd
data = """    
Tweets
The apple is red
The grape is purple
The tree is green 
"""
datb = """    
Words
The
is
apple
grape 
    """

dfa = pd.read_csv(pd.compat.StringIO(data), sep=';')
dfb = pd.read_csv(pd.compat.StringIO(datb), sep=';')

Words = dfb['Words'].values
dico = {}
for word in Words:
    dico[word] = dfa.Tweets.str.count(word).sum()

print(dico)

输出：

{'The': 3, 'is': 3, 'apple': 1, 'grape ': 1}

Answer 2

您可以为此使用默认词典来存储所有字数，如下所示：

from collections import defaultdict

word_counts = defaultdict(int)
for tweet in tweets:
    for word in tweet:
        word_counts[word] += 1
# print(word_counts['some_word']) will output occurrence of some_word

Answer 3

这将把您的单词列表变成字典

 import collections

 words = tweets.split()
 counter = collections.Counter(words)

 for key , value in sorted(counter.items()):
      print("`{}` is repeated {} time".format(key , value))

计算一个单词在n条推文中出现的次数

问题描述

3 个解决方案

解决方案1
5 已采纳 2019-04-03 18:05:07

解决方案2
1 2019-04-03 18:08:43

解决方案3
0 2019-04-03 18:10:28

计算一个单词在n条推文中出现的次数

问题描述

3 个解决方案

解决方案1 5 已采纳 2019-04-03 18:05:07

解决方案2 1 2019-04-03 18:08:43

解决方案3 0 2019-04-03 18:10:28

解决方案1
5 已采纳 2019-04-03 18:05:07

解决方案2
1 2019-04-03 18:08:43

解决方案3
0 2019-04-03 18:10:28