[英]Python: How to update value of key value pair in nested dictionary?
i am trying to make an inversed document index, therefore i need to know from all unique words in a collection in which doc they occur and how often. 我正在尝试制作一个反向的文档索引,因此我需要从集合中的所有独特单词中了解它们发生在哪些文档中以及发生的频率。
i have used this answer in order two create a nested dictionary. 我已经使用这个答案,以便两个创建一个嵌套字典。 The provided solution works fine, with one problem though.
提供的解决方案工作正常,但有一个问题。
First i open the file and make a list of unique words. 首先,我打开文件并列出一个独特的单词列表。 These unique words i than want to compare with the original file.
这些独特的单词我想要与原始文件进行比较。 When there is a match, the frequency counter should be updated and its value be stored in the two dimensional array.
当存在匹配时,应更新频率计数器并将其值存储在二维数组中。
output should eventually look like this: 输出最终应该如下所示:
word1, {doc1 : freq}, {doc2 : freq} <br>
word2, {doc1 : freq}, {doc2 : freq}, {doc3:freq}
etc....
Problem is that i cannot update the dictionary variable. 问题是我无法更新字典变量。 When trying to do so i get the error:
尝试这样做时,我收到错误:
File "scriptV3.py", line 45, in main
freq = dictionary[keyword][filename] + 1
TypeError: unsupported operand type(s) for +: 'AutoVivification' and 'int'
I think i need to cast in some way the instance of AutoVivification to int.... 我想我需要以某种方式将AutoVivification的实例转换为int ....
How to go? 怎么去?
thanks in advance 提前致谢
my code: 我的代码:
#!/usr/bin/env python
# encoding: utf-8
import sys
import os
import re
import glob
import string
import sets
class AutoVivification(dict):
"""Implementation of perl's autovivification feature."""
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
def main():
pad = 'temp/'
dictionary = AutoVivification()
docID = 0
for files in glob.glob( os.path.join(pad, '*.html') ): #for all files in specified folder:
docID = docID + 1
filename = "doc_"+str(docID)
text = open(files, 'r').read() #returns content of file as string
text = extract(text, '<pre>', '</pre>') #call extract function to extract text from within <pre> tags
text = text.lower() #all words to lowercase
exclude = set(string.punctuation) #sets list of all punctuation characters
text = ''.join(char for char in text if char not in exclude) # use created exclude list to remove characters from files
text = text.split() #creates list (array) from string
uniques = set(text) #make list unique (is dat handig? we moeten nog tellen)
for keyword in uniques: #For every unique word do
for word in text: #for every word in doc:
if (word == keyword and dictionary[keyword][filename] is not None): #if there is an occurence of keyword increment counter
freq = dictionary[keyword][filename] #here we fail, cannot cast object instance to integer.
freq = dictionary[keyword][filename] + 1
print(keyword,dictionary[keyword])
else:
dictionary[word][filename] = 1
#extract text between substring 1 and 2
def extract(text, sub1, sub2):
return text.split(sub1, 1)[-1].split(sub2, 1)[0]
if __name__ == '__main__':
main()
One could use Python's collections.defaultdict instead of creating an AutoVivification class and then instantiating dictionary as an object of that type. 可以使用Python的collections.defaultdict而不是创建AutoVivification类,然后将字典实例化为该类型的对象。
import collections
dictionary = collections.defaultdict(lambda: collections.defaultdict(int))
This will create a dictionary of dictionaries with a default value of 0. When you wish to increment an entry, use: 这将创建一个字典字典,其默认值为0.如果要增加条目,请使用:
dictionary[keyword][filename] += 1
I agree you should avoid the extra classes, and especially __getitem__
. 我同意你应该避免额外的课程,特别是
__getitem__
。 (Small conceptual errors can make __getitem__
or __getattr__
quite painful to debug.) (小的概念错误可能会使
__getitem__
或__getattr__
非常痛苦。)
Python dict
seems quite strong enough for what you are doing. 对于你正在做的事情,Python
dict
似乎非常强大。
What about straightforward dict.setdefault
直截了当的
dict.setdefault
怎么样dict.setdefault
for keyword in uniques: #For every unique word do
for word in text: #for every word in doc:
if (word == keyword):
dictionary.setdefault(keyword, {})
dictionary[keyword].setdefault(filename, 0)
dictionary[keyword][filename] += 1
Of course this would be where dictionary
is just a dict
, and not something from collections
or a custom class of your own. 当然,这将是
dictionary
只是一个dict
,而不是collections
或自己的自定义类。
Then again, isn't this just: 然后,这不仅仅是:
for word in text: #for every word in doc:
dictionary.setdefault(word, {})
dictionary[word].setdefault(filename, 0)
dictionary[word][filename] += 1
No reason to isolate unique instances, since the dict forces unique keys anyway. 没有理由隔离唯一的实例,因为dict强制使用唯一的键。
if (word == keyword and dictionary[keyword][filename] is not None):
that is not a correct usage i guess, instead try this: 我想这不是一个正确的用法,而是试试这个:
if (word == keyword and filename in dictionary[keyword]):
Because, checking the value of a non-existing key raise KeyError. 因为,检查不存在的键的值会引发KeyError。 :so You must check if key exists in dictionary...
:所以你必须检查字典中是否存在密钥...
I think you are trying to add 1 to a dictionary entry that doesn't yet exist. 我想您正在尝试将1添加到尚不存在的字典条目中。 Your getitem method is for some reason returning a new instance of the AutoVivification class when a lookup fails.
由于某种原因,getitem方法在查找失败时返回AutoVivification类的新实例。 You're therefore trying to add 1 to a new instance of the class.
因此,您尝试将1添加到该类的新实例中。
I think the answer is to update the getitem method so that it sets the counter to 0 if it doesn't yet exist. 我认为答案是更新getitem方法,以便在计数器尚不存在时将其设置为0。
class AutoVivification(dict):
"""Implementation of perl's autovivification feature."""
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
self[item] = 0
return 0
Hope this helps. 希望这可以帮助。
Not sure why you need nested dicts here. 不知道为什么你需要嵌套的dicts。 In a typical index scenario you have a forward index mapping
在典型的索引方案中,您有一个正向索引映射
document id -> [word_ids] 文件ID - > [word_ids]
and an inverse index mapping 和反向索引映射
word_id -> [document_ids] word_id - > [document_ids]
Not sure if this is related here but using two indexes you can perform all kind of queries very efficiently and the implementation is straight forward since you don't need to deal with nested data structures. 不确定这是否与此相关,但使用两个索引可以非常有效地执行所有类型的查询,并且实现很简单,因为您不需要处理嵌套数据结构。
In the AutoVivification class, you define 在AutoVivification类中,您可以定义
value = self[item] = type(self)()
return value
which returns an instance of self, which is an AutoVivification in that context. 返回self的一个实例,该实例是该上下文中的AutoVivification。 The error becomes then clear.
然后错误变得清晰。
Are you sure you want to return an AutoVivification on any missing key query? 您确定要在任何缺少的密钥查询上返回AutoVivification吗? From the code, I would assume you want to return a normal dictionary with string key and int values.
从代码中,我假设您想要返回一个包含字符串键和int值的普通字典。
By the way, maybe you would be interested in the defaultdict class. 顺便说一下,也许你会对defaultdict类感兴趣。
It would be better to kick AutoVivification
out all together, because it adds nothing. 将
AutoVivification
全部放在一起会更好,因为它什么都不添加。
The following line: 以下行:
if (word == keyword and dictionary[keyword][filename] is not None):
Doesn't work as expected, because of the way your class works, dictionary[keyword]
will always return an instance of AutoVivification
, and so will dictionary[keyword][filename]
. 由于您的类的工作方式不能正常工作,因此
dictionary[keyword]
将始终返回AutoVivification
的实例,因此dictionary[keyword][filename]
。
This AutoVivification class is not the magic you are looking for. 此AutoVivification类不是您正在寻找的魔力。
Check out collections.defaultdict
from the standard library. 查看标准库中的
collections.defaultdict
。 Your inner dicts should be defaultdicts that default to integer values, and your outer dicts would then be defaultdicts that default to inner-dict values. 你的内部dicts应该是默认为整数值的默认值,而你的外部dicts则是默认为inner-dict值的默认值。
#!/usr/bin/env python
# encoding: utf-8
from os.path import join
from glob import glob as glob_
from collections import defaultdict, Counter
from string import punctuation
WORKDIR = 'temp/'
FILETYPE = '*.html'
OUTF = 'doc_{0}'.format
def extract(text, startTag='<pre>', endTag='</pre>'):
"""Extract text between start tag and end tag
Start at first char following first occurrence of startTag
If none, begin at start of text
End at last char preceding first subsequent occurrence of endTag
If none, end at end of text
"""
return text.split(startTag, 1)[-1].split(endTag, 1)[0]
def main():
DocWords = defaultdict(dict)
infnames = glob_(join(WORKDIR, FILETYPE))
for docId,infname in enumerate(infnames, 1):
outfname = OUTF(docId)
with open(infname) as inf:
text = inf.read().lower()
words = extract(text).strip(punctuation).split()
for wd,num in Counter(words).iteritems():
DocWords[wd][outfname] = num
if __name__ == '__main__':
main()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.