简体   繁体   English

如何在多个单独的文本中找到最常用的单词?

[英]How do I find the most common words in multiple separate texts?

Bit of a simple question really, but I can't seem to crack it. 真的有点简单的问题,但我似乎无法破解它。 I have a string that is formatted in the following way: 我有一个字符串,格式如下:

["category1",("data","data","data")]
["category2", ("data","data","data")]

I call the different categories posts and I want to get the most frequent words from the data section. 我打电话给不同类别的帖子,我想从数据部分获得最常用的单词。 So I tried: 所以我尝试过:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       if token in freq_dict:
           freq_dict[token] += 1
       else:
           freq_dict[token] = 1
   top = sorted(freq_dict, key=freq_dict.get, reverse=True)
   top = top[:50]
   print top

However, this will give me the top words PER post in the string. 但是,这会给我字符串中的PER字符。

I need a general top words list. 我需要一个通用的顶级单词列表。
However if I take print top out of the for loop, it only gives me the results of the last post. 但是,如果我从for循环中取出打印顶部,它只会给我最后一篇文章的结果。
Does anyone have an idea? 有没有人有想法?

This is a scope problem. 这是一个范围问题。 Also, you don't need to initialize the elements of defaultdict , so this simplifies your code: 此外,您不需要初始化defaultdict的元素,因此这简化了您的代码:

Try it like this: 试试这样:

posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

This, as expected, outputs 正如预期的那样,这是产出

['data1', 'data3', 'data5', 'data2']

as a result. 结果是。

If you really have something like 如果你真的有类似的东西

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

as an input, you won't need wordpunct_tokenize() as the input data is already tokenized. 作为输入,您不需要wordpunct_tokenize()因为输入数据已经被标记化。 Then, the following would work: 然后,以下将工作:

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

from collections import defaultdict
freq_dict = defaultdict(int)

for cat, tokens in posts:
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

and it also outputs the expected result: 它还输出预期的结果:

['data1', 'data3', 'data5', 'data2']

Why not just use Counter ? 为什么不使用Counter

In [30]: from collections import Counter

In [31]: data=["category1",("data","data","data")]

In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})

In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]
from itertools import chain
from collections import Counter
from nltk.tokenize import wordpunct_tokenize
texts=["a quick brown car", "a fast yellow rose", "a quick night rider", "a yellow officer"]
print Counter(chain.from_iterable(wordpunct_tokenize(x) for x in texts)).most_common(3)

outputs: 输出:

[('a', 4), ('yellow', 2), ('quick', 2)]

As you can see in the documentation for Counter.most_common , the returned list is sorted. 正如您在Counter.most_common的文档中所看到的,返回的列表已经过排序。

To use with your code, you can do 要使用您的代码,您可以这样做

texts = (x[1] for x in posts)

or you can do 或者你可以做

... wordpunct_tokenize(x[1]) for x in texts ...

If your posts actually look like this: 如果你的帖子实际上是这样的:

posts=[("category1",["a quick brown car", "a fast yellow rose"]), ("category2",["a quick night rider", "a yellow officer"])]

You can get rid of the categories: 你可以摆脱类别:

texts = list(chain.from_iterable(x[1] for x in posts))

( texts will be ['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer'] ) texts将是['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer']

You can then use that in the snippet of the top of this answer. 然后,您可以在此答案顶部的片段中使用它。

Just change your code to allow for the posts to be processed and then get the top words: 只需更改您的代码以允许处理帖子,然后获取顶部字词:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict

freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       freq_dict[token] += 1
# get top after all posts have been processed.
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM