简体   繁体   English

创建文字中的单词的字典

[英]Creating a dictionary of the words in text

I want to create a dictionary of all unique words in the text. 我想创建一个文本中所有唯一单词的字典。 The key is the word and the value is the word's frequency 关键是单词,值是单词的频率

dtt = ['you want home at our peace', 'we went our home', 'our home is nice', 'we want peace at home']
word_listT = str(' '.join(dtt)).split()
wordsT = {v:k for (k, v) in enumerate(word_listT)}
print wordsT

I expect something like this: 我期望这样的事情:

{'we': 2, 'is': 1, 'peace': 2, 'at': 2, 'want': 2, 'our': 3, 'home': 4, 'you': 1, 'went': 1, 'nice': 1}

However, I receive this: 但是,我收到此信息:

{'we': 14, 'is': 12, 'peace': 16, 'at': 17, 'want': 15, 'our': 10, 'home': 18, 'you': 0, 'went': 7, 'nice': 13}

Apparently, I am misusing the functionality or doing something wrong. 显然,我在滥用功能或做错了事。

Please, help 请帮忙

The problem with what you are doing is you are storing the array index of where the word is instead of a count of those words. 您正在做的问题是您存储的是单词所在位置的数组索引,而不是这些单词的计数。

To achieve this you can just use collections.Counter 为此,您可以使用collections.Counter

from collections import Counter

dtt = ['you want home at our peace', 'we went our home', 'our home is nice', 'we want peace at home']
counted_words = Counter(' '.join(dtt).split())
# if you want to see what the counted words are you can print it
print counted_words

>>> Counter({'home': 4, 'our': 3, 'we': 2, 'peace': 2, 'at': 2, 'want': 2, 'is': 1, 'you': 1, 'went': 1, 'nice': 1})

SOME CLEANUP: as mentioned in the comments 一些清理:如评论中所述

str() is unnecessary for your ' '.join(dtt).split() 对于您的' '.join(dtt).split()不需要str() ' '.join(dtt).split()

You can also remove the list assignment and do your counter on the same line 您也可以删除列表分配,然后在同一行进行计数

Counter(' '.join(dtt).split())

A little more detail about your list indices; 有关列表索引的更多详细信息; first you have to understand what your code is doing. 首先,您必须了解您的代码在做什么。

dtt = [
    'you want home at our peace', 
    'we went our home', 
    'our home is nice', 
    'we want peace at home'
]

Notice you have 19 words here; 注意这里有19个字; print len(word_listT) returns 19. Now on the next line word_listT = str(' '.join(dtt)).split() you are making a list of all of the words, which looks like this print len(word_listT)返回19。现在在下一行word_listT = str(' '.join(dtt)).split()您将列出所有单词,看起来像这样

word_listT = [
    'you', 
    'want', 
    'home', 
    'at', 
    'our', 
    'peace', 
    'we', 
    'went', 
    'our', 
    'home', 
    'our', 
    'home', 
    'is', 
    'nice', 
    'we', 
    'want', 
    'peace', 
    'at', 
    'home'
] 

Count them again: 19 words. 再数一次:19个字。 The very last word is 'home'. 最后一个词是“家”。 And list indices start at 0 so 0 to 18 = 19 elements. 列表索引从0开始,因此0到18 = 19个元素。 yourlist[18] is 'home'. yourlist[18]是“家”。 This has nothing to do with the string location or anything, just the index of your new array. 这与字符串位置无关,仅与新数组的索引无关。 :) :)

Try this: 尝试这个:

from collections import defaultdict

dtt = ['you want home at our peace', 'we went our home', 'our home is nice', 'we want peace at home']
word_list = str(' '.join(dtt)).split()
d = defaultdict(int)
for word in word_list:
    d[word] += 1

enumerate returns a list of words with their indices, not with their frequency. enumerate返回单词列表及其索引,而不是其频率。 That is, when you create the wordsT dictionary, each v is actually the index in word_listT of the last instance of k . 也就是说,当您创建wordsT字典时,每个v实际上是k的最后一个实例在word_listT中的索引。 To do what you want, using a for-loop is probably the most straightforward. 要执行所需的操作,使用for循环可能是最简单的方法。

wordsT = {}
for word in word_listT:
    try:
        wordsT[word]+=1
    except KeyError:
        wordsT[word] = 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM