简体   繁体   中英

Count the number of rows that each word appears in

I have a training dataset that is a numpy array, and has the shape of (4800,1). It has a column of strings, and each row correspond to the texts from a different email.

I want to create a dictionary that counts the number of emails (or number of rows) each word appears in using python. And eventually only select the words that appear in at least 10 emails. I could only figure out how to count the frequency of the words appearing in the entire dataset, not in how many rows/emails. The following code is what i have so far

Here is an example of what the array looks like, and what it should output.

 [['red blue green green']
 ['red blue blue'] 
 ['red red red']]

output:

{'red': 3, 'blue': '2', 'green': '1'}
def vocab_dict(file):
    d = dict() 
    for row in xTrain:
        words = row.split(" ") 
        for word in words: 
            if word in d: 
                d[word] = d[word] + 1
            else: 
                d[word] = 1
    d = dict((k, v) for k, v in d.items() if v >= 10)
    return d

I am stuck on how to modify the above code that counts how many times a word appears in the whole dataset to how many times a word appears in each row(each email).

Let's say we have a list of strings l . Then we can do:

from collections import Counter

word_lists = [text.split(" ") for text in l] # split into words
word_sets = [set(word_list) for word_list in word_lists] # make sets, discard duplicates

c = Counter()
for word_set in word_sets:
    c.update(word_set)
print(c)

c will now contain, for each word, the number of emails that that word is in.

You want to iterate over each line, and for each unique word in that sentence, add one to the dict element representing that word. You can get the unique words by converting the list to a set .

def vocab_dict(data):
    lines_count = {}
    for line in data:
        for word in set(line.split()):
            old_count = lines_count.get(word, 0)
            lines_count[word] = old_count + 1
    return lines_count

The dict.get() function returns the value of that key, with a default of 0 if the key doesn't exist. Alternatively, you could use collections.defaultdict .

Testing:

l = ['red blue green green', 'red blue blue', 'red red red']
vocab_dict(l)
# Out:  {'green': 1, 'blue': 2, 'red': 3}

One option is to change the words list into a set so that it gets rid of the repetition. You can do it like:

[...]
for word in set(words):
    if word in d:
    [...]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM