如何在多个单独的文本中找到最常用的单词？

Question

Bit of a simple question really, but I can't seem to crack it. 真的有点简单的问题，但我似乎无法破解它。 I have a string that is formatted in the following way: 我有一个字符串，格式如下：

["category1",("data","data","data")]
["category2", ("data","data","data")]

I call the different categories posts and I want to get the most frequent words from the data section. 我打电话给不同类别的帖子，我想从数据部分获得最常用的单词。 So I tried: 所以我尝试过：

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       if token in freq_dict:
           freq_dict[token] += 1
       else:
           freq_dict[token] = 1
   top = sorted(freq_dict, key=freq_dict.get, reverse=True)
   top = top[:50]
   print top

However, this will give me the top words PER post in the string. 但是，这会给我字符串中的PER字符。

I need a general top words list. 我需要一个通用的顶级单词列表。
However if I take print top out of the for loop, it only gives me the results of the last post. 但是，如果我从for循环中取出打印顶部，它只会给我最后一篇文章的结果。
Does anyone have an idea? 有没有人有想法？

Answer 1

This is a scope problem. 这是一个范围问题。 Also, you don't need to initialize the elements of defaultdict , so this simplifies your code: 此外，您不需要初始化defaultdict的元素，因此这简化了您的代码：

Try it like this: 试试这样：

posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

This, as expected, outputs 正如预期的那样，这是产出

['data1', 'data3', 'data5', 'data2']

as a result. 结果是。

If you really have something like 如果你真的有类似的东西

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

as an input, you won't need wordpunct_tokenize() as the input data is already tokenized. 作为输入，您不需要wordpunct_tokenize()因为输入数据已经被标记化。 Then, the following would work: 然后，以下将工作：

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

from collections import defaultdict
freq_dict = defaultdict(int)

for cat, tokens in posts:
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

and it also outputs the expected result: 它还输出预期的结果：

['data1', 'data3', 'data5', 'data2']

Answer 2

Why not just use Counter ? 为什么不使用Counter ？

In [30]: from collections import Counter

In [31]: data=["category1",("data","data","data")]

In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})

In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]

Answer 3

from itertools import chain
from collections import Counter
from nltk.tokenize import wordpunct_tokenize
texts=["a quick brown car", "a fast yellow rose", "a quick night rider", "a yellow officer"]
print Counter(chain.from_iterable(wordpunct_tokenize(x) for x in texts)).most_common(3)

outputs: 输出：

[('a', 4), ('yellow', 2), ('quick', 2)]

As you can see in the documentation for Counter.most_common , the returned list is sorted. 正如您在Counter.most_common的文档中所看到的，返回的列表已经过排序。

To use with your code, you can do 要使用您的代码，您可以这样做

texts = (x[1] for x in posts)

or you can do 或者你可以做

... wordpunct_tokenize(x[1]) for x in texts ...

If your posts actually look like this: 如果你的帖子实际上是这样的：

posts=[("category1",["a quick brown car", "a fast yellow rose"]), ("category2",["a quick night rider", "a yellow officer"])]

You can get rid of the categories: 你可以摆脱类别：

texts = list(chain.from_iterable(x[1] for x in posts))

( texts will be ['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer'] ) （ texts将是['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer'] ）

You can then use that in the snippet of the top of this answer. 然后，您可以在此答案顶部的片段中使用它。

Answer 4

Just change your code to allow for the posts to be processed and then get the top words: 只需更改您的代码以允许处理帖子，然后获取顶部字词：

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict

freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       freq_dict[token] += 1
# get top after all posts have been processed.
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

如何在多个单独的文本中找到最常用的单词？

问题描述

4 个解决方案

解决方案1
3 2013-05-04 14:38:18

解决方案2
3 2013-05-04 14:53:26

解决方案3
2 2013-05-04 14:52:23

解决方案4
1 2013-05-04 14:38:09

如何在多个单独的文本中找到最常用的单词？

问题描述

4 个解决方案

解决方案1 3 2013-05-04 14:38:18

解决方案2 3 2013-05-04 14:53:26

解决方案3 2 2013-05-04 14:52:23

解决方案4 1 2013-05-04 14:38:09

解决方案1
3 2013-05-04 14:38:18

解决方案2
3 2013-05-04 14:53:26

解决方案3
2 2013-05-04 14:52:23

解决方案4
1 2013-05-04 14:38:09