简体   繁体   English

遍历字典中的多个值?

[英]Iterating through multiple values in a dictionary?

I have a list and dictionary of words: 我有一个单词列表和字典:

word_list = ["it's","they're","there's","he's"]

And a dictionary containing information as to how frequently the words in words_list appear in several documents: 以及一本字典,其中包含有关words_list的单词在多个文档中出现的频率的信息:

dict = [('document1',{"it's": 0,"they're": 2,"there's": 5,"he's": 1}),
('document2',{"it's": 4,"they're": 2,"there's": 3,"he's": 0}),
('document3',{"it's": 7,"they're": 0,"there's": 4,"he's": 1})]

I want to develop a data structure (data frame, perhaps?) that looks like the following: 我想开发一个数据结构(也许是数据框架?),如下所示:

file       word       count
document1  it's        0
document1  they're     2
document1  there's     5
document1  he's        1
document2  it's        4
document2  they're     2
document2  there's     3
document2  he's        0
document3  it's        7
document3  they're     0
document3  there's     4
document3  he's        1

I'm trying to find the words used most often in these documents. 我试图找到这些文档中最常用的words I have more than 900 documents. 我有900多个文件。

I'm thinking of something like the following: 我在想以下内容:

res = {}
for i in words_list:
    count = 0
    for j in dict.items():
         if i == j:
              count = count + 1
              res[i,j] = count

Where can I go from here? 我可以从这里去哪里?

Ok first things first, your dict is not a dict and should be built as one like so 首先,您的字典不是字典,应该像这样构建

d = {'document1':{"it's": 0,"they're": 2,"there's": 5,"he's": 1},
    'document2':{"it's": 4,"they're": 2,"there's": 3,"he's": 0},
    'document3':{"it's": 7,"they're": 0,"there's": 4,"he's": 1}}

now that we actually have a dictionary we can use pandas to build a dataframe but in order to get it the way you want we will have to build a list of lists out of the dictionary. 现在我们已经有了一个字典,我们可以使用pandas来构建一个数据框,但是为了以您想要的方式获得它,我们必须在字典中构建一个列表列表。 Then we will create a dataframe and label the columns and then sort 然后,我们将创建一个数据框并标记列,然后进行排序

import collections
import pandas as pd

d = {'document1':{"it's": 0,"they're": 2,"there's": 5,"he's": 1},
    'document2':{"it's": 4,"they're": 2,"there's": 3,"he's": 0},
    'document3':{"it's": 7,"they're": 0,"there's": 4,"he's": 1}}

d = pd.DataFrame([[k,k1,v1] for k,v in d.items() for k1,v1 in v.items()], columns = ['File','Words','Count'])
print d.sort(['File','Count'], ascending=[1,1])

         File    Words  Count
1   document1     it's      0
0   document1     he's      1
3   document1  they're      2
2   document1  there's      5
4   document2     he's      0
7   document2  they're      2
6   document2  there's      3
5   document2     it's      4
11  document3  they're      0
8   document3     he's      1
10  document3  there's      4
9   document3     it's      7

If you want the top n occurrences then you can use groupby() and then either head() or tail() when sorting 如果希望出现前n个,则可以在排序时使用groupby() ,然后使用head() or tail()

d = d.sort(['File','Count'], ascending=[1,1]).groupby('File').head(2)

         File    Words  Count
1   document1     it's      0
0   document1     he's      1
4   document2     he's      0
7   document2  they're      2
11  document3  they're      0
8   document3     he's      1

the list comprehension returns a list of lists that looks like this list comprehension返回看起来像这样的列表列表

d = [['document1', "he's", 1], ['document1', "it's", 0], ['document1', "there's", 5], ['document1', "they're", 2], ['document2', "he's", 0], ['document2', "it's", 4], ['document2', "there's", 3], ['document2', "they're", 2], ['document3', "he's", 1], ['document3', "it's", 7], ['document3', "there's", 4], ['document3', "they're", 0]]

in order to build the dictionary properly you would just use something along the lines of 为了正确地构建字典,您只需使用以下内容:

d['document1']['it\'s'] = 1

If for some reason you are dead set on using the list of tuples of str's and dicts you can use this list comprehension instead 如果由于某种原因您不愿意使用str和dict的元组列表,则可以改用此列表理解

[[i[0],k1,v1] for i in d for k1,v1 in i[1].items()]

How about something like this? 这样的事情怎么样?

word_list = ["it's","they're","there's","he's"]

frequencies = [('document1',{"it's": 0,"they're": 2,"there's": 5,"he's": 1}),
('document2',{"it's": 4,"they're": 2,"there's": 3,"he's": 0}),
('document3',{"it's": 7,"they're": 0,"there's": 4,"he's": 1})]

result = []
for document in frequencies:
    for word in word_list:
        result.append({"file":document[0], "word":word,"count":document[1][word]})

print result

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM