简体   繁体   English

计算每个单词出现的行数

[英]Count the number of rows that each word appears in

I have a training dataset that is a numpy array, and has the shape of (4800,1).我有一个训练数据集,它是一个 numpy 数组,形状为 (4800,1)。 It has a column of strings, and each row correspond to the texts from a different email.它有一列字符串,每一行对应来自不同电子邮件的文本。

I want to create a dictionary that counts the number of emails (or number of rows) each word appears in using python.我想创建一个字典来计算每个单词出现在使用 python 中的电子邮件数(或行数)。 And eventually only select the words that appear in at least 10 emails.最终只选择出现在至少 10 封电子邮件中的词。 I could only figure out how to count the frequency of the words appearing in the entire dataset, not in how many rows/emails.我只能弄清楚如何计算整个数据集中出现的单词的频率,而不是在多少行/电子邮件中。 The following code is what i have so far以下代码是我到目前为止

Here is an example of what the array looks like, and what it should output.下面是一个数组是什么样子的例子,它应该输出什么。

 [['red blue green green']
 ['red blue blue'] 
 ['red red red']]

output:输出:

{'red': 3, 'blue': '2', 'green': '1'}
def vocab_dict(file):
    d = dict() 
    for row in xTrain:
        words = row.split(" ") 
        for word in words: 
            if word in d: 
                d[word] = d[word] + 1
            else: 
                d[word] = 1
    d = dict((k, v) for k, v in d.items() if v >= 10)
    return d

I am stuck on how to modify the above code that counts how many times a word appears in the whole dataset to how many times a word appears in each row(each email).我被困在如何修改上面的代码,该代码计算一个单词在整个数据集中出现的次数到每行(每封电子邮件)中一个单词出现的次数。

Let's say we have a list of strings l .假设我们有一个字符串列表l Then we can do:然后我们可以这样做:

from collections import Counter

word_lists = [text.split(" ") for text in l] # split into words
word_sets = [set(word_list) for word_list in word_lists] # make sets, discard duplicates

c = Counter()
for word_set in word_sets:
    c.update(word_set)
print(c)

c will now contain, for each word, the number of emails that that word is in. c现在将为每个单词包含该单词所在的电子邮件数量。

You want to iterate over each line, and for each unique word in that sentence, add one to the dict element representing that word.您想遍历每一行,并为该句子中的每个唯一单词添加一个到表示该单词的 dict 元素。 You can get the unique words by converting the list to a set .您可以通过将列表转换为set来获取唯一词。

def vocab_dict(data):
    lines_count = {}
    for line in data:
        for word in set(line.split()):
            old_count = lines_count.get(word, 0)
            lines_count[word] = old_count + 1
    return lines_count

The dict.get() function returns the value of that key, with a default of 0 if the key doesn't exist. dict.get()函数返回该键的值,如果该键不存在,则默认为 0。 Alternatively, you could use collections.defaultdict .或者,您可以使用collections.defaultdict

Testing:测试:

l = ['red blue green green', 'red blue blue', 'red red red']
vocab_dict(l)
# Out:  {'green': 1, 'blue': 2, 'red': 3}

One option is to change the words list into a set so that it gets rid of the repetition.一种选择是将words列表更改为一组,以消除重复。 You can do it like:你可以这样做:

[...]
for word in set(words):
    if word in d:
    [...]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM