简体   繁体   中英

How do I find the most common words in multiple separate texts?

Bit of a simple question really, but I can't seem to crack it. I have a string that is formatted in the following way:

["category1",("data","data","data")]
["category2", ("data","data","data")]

I call the different categories posts and I want to get the most frequent words from the data section. So I tried:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       if token in freq_dict:
           freq_dict[token] += 1
       else:
           freq_dict[token] = 1
   top = sorted(freq_dict, key=freq_dict.get, reverse=True)
   top = top[:50]
   print top

However, this will give me the top words PER post in the string.

I need a general top words list.
However if I take print top out of the for loop, it only gives me the results of the last post.
Does anyone have an idea?

This is a scope problem. Also, you don't need to initialize the elements of defaultdict , so this simplifies your code:

Try it like this:

posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

This, as expected, outputs

['data1', 'data3', 'data5', 'data2']

as a result.

If you really have something like

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

as an input, you won't need wordpunct_tokenize() as the input data is already tokenized. Then, the following would work:

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

from collections import defaultdict
freq_dict = defaultdict(int)

for cat, tokens in posts:
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

and it also outputs the expected result:

['data1', 'data3', 'data5', 'data2']

Why not just use Counter ?

In [30]: from collections import Counter

In [31]: data=["category1",("data","data","data")]

In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})

In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]
from itertools import chain
from collections import Counter
from nltk.tokenize import wordpunct_tokenize
texts=["a quick brown car", "a fast yellow rose", "a quick night rider", "a yellow officer"]
print Counter(chain.from_iterable(wordpunct_tokenize(x) for x in texts)).most_common(3)

outputs:

[('a', 4), ('yellow', 2), ('quick', 2)]

As you can see in the documentation for Counter.most_common , the returned list is sorted.

To use with your code, you can do

texts = (x[1] for x in posts)

or you can do

... wordpunct_tokenize(x[1]) for x in texts ...

If your posts actually look like this:

posts=[("category1",["a quick brown car", "a fast yellow rose"]), ("category2",["a quick night rider", "a yellow officer"])]

You can get rid of the categories:

texts = list(chain.from_iterable(x[1] for x in posts))

( texts will be ['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer'] )

You can then use that in the snippet of the top of this answer.

Just change your code to allow for the posts to be processed and then get the top words:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict

freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       freq_dict[token] += 1
# get top after all posts have been processed.
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM