简体   繁体   English

Python:如何在嵌套字典中更新键值对的值?

[英]Python: How to update value of key value pair in nested dictionary?

i am trying to make an inversed document index, therefore i need to know from all unique words in a collection in which doc they occur and how often. 我正在尝试制作一个反向的文档索引,因此我需要从集合中的所有独特单词中了解它们发生在哪些文档中以及发生的频率。

i have used this answer in order two create a nested dictionary. 我已经使用这个答案,以便两个创建一个嵌套字典。 The provided solution works fine, with one problem though. 提供的解决方案工作正常,但有一个问题。

First i open the file and make a list of unique words. 首先,我打开文件并列出一个独特的单词列表。 These unique words i than want to compare with the original file. 这些独特的单词我想要与原始文件进行比较。 When there is a match, the frequency counter should be updated and its value be stored in the two dimensional array. 当存在匹配时,应更新频率计数器并将其值存储在二维数组中。

output should eventually look like this: 输出最终应该如下所示:

word1, {doc1 : freq}, {doc2 : freq} <br>
word2, {doc1 : freq}, {doc2 : freq}, {doc3:freq}
etc....

Problem is that i cannot update the dictionary variable. 问题是我无法更新字典变量。 When trying to do so i get the error: 尝试这样做时,我收到错误:

  File "scriptV3.py", line 45, in main
    freq = dictionary[keyword][filename] + 1
TypeError: unsupported operand type(s) for +: 'AutoVivification' and 'int'

I think i need to cast in some way the instance of AutoVivification to int.... 我想我需要以某种方式将AutoVivification的实例转换为int ....

How to go? 怎么去?

thanks in advance 提前致谢

my code: 我的代码:

#!/usr/bin/env python 
# encoding: utf-8

import sys
import os
import re
import glob
import string
import sets

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

def main():
    pad = 'temp/'
    dictionary  = AutoVivification()
    docID = 0
    for files in glob.glob( os.path.join(pad, '*.html') ):  #for all files in specified folder:
        docID = docID + 1
        filename = "doc_"+str(docID)
        text = open(files, 'r').read()                      #returns content of file as string
        text = extract(text, '<pre>', '</pre>')             #call extract function to extract text from within <pre> tags
        text = text.lower()                                 #all words to lowercase
        exclude = set(string.punctuation)                   #sets list of all punctuation characters
        text = ''.join(char for char in text if char not in exclude) # use created exclude list to remove characters from files
        text = text.split()                                 #creates list (array) from string
        uniques = set(text)                                 #make list unique (is dat handig? we moeten nog tellen)

        for keyword in uniques:                             #For every unique word do   
            for word in text:                               #for every word in doc:
                if (word == keyword and dictionary[keyword][filename] is not None): #if there is an occurence of keyword increment counter 
                    freq = dictionary[keyword][filename]    #here we fail, cannot cast object instance to integer.
                    freq = dictionary[keyword][filename] + 1
                    print(keyword,dictionary[keyword])
                else:
                    dictionary[word][filename] = 1

#extract text between substring 1 and 2 
def extract(text, sub1, sub2): 
    return text.split(sub1, 1)[-1].split(sub2, 1)[0]    

if __name__ == '__main__':
    main()

One could use Python's collections.defaultdict instead of creating an AutoVivification class and then instantiating dictionary as an object of that type. 可以使用Python的collections.defaultdict而不是创建AutoVivification类,然后将字典实例化为该类型的对象。

import collections
dictionary = collections.defaultdict(lambda: collections.defaultdict(int))

This will create a dictionary of dictionaries with a default value of 0. When you wish to increment an entry, use: 这将创建一个字典字典,其默认值为0.如果要增加条目,请使用:

dictionary[keyword][filename] += 1

I agree you should avoid the extra classes, and especially __getitem__ . 我同意你应该避免额外的课程,特别是__getitem__ (Small conceptual errors can make __getitem__ or __getattr__ quite painful to debug.) (小的概念错误可能会使__getitem____getattr__非常痛苦。)

Python dict seems quite strong enough for what you are doing. 对于你正在做的事情,Python dict似乎非常强大。

What about straightforward dict.setdefault 直截了当的dict.setdefault怎么样dict.setdefault

    for keyword in uniques:                             #For every unique word do   
        for word in text:                               #for every word in doc:
            if (word == keyword):
                dictionary.setdefault(keyword, {})
                dictionary[keyword].setdefault(filename, 0)
                dictionary[keyword][filename] += 1

Of course this would be where dictionary is just a dict , and not something from collections or a custom class of your own. 当然,这将是dictionary只是一个dict ,而不是collections或自己的自定义类。

Then again, isn't this just: 然后,这不仅仅是:

        for word in text:                               #for every word in doc:
            dictionary.setdefault(word, {})
            dictionary[word].setdefault(filename, 0)
            dictionary[word][filename] += 1

No reason to isolate unique instances, since the dict forces unique keys anyway. 没有理由隔离唯一的实例,因为dict强制使用唯一的键。

if (word == keyword and dictionary[keyword][filename] is not None): 

that is not a correct usage i guess, instead try this: 我想这不是一个正确的用法,而是试试这个:

if (word == keyword and filename in dictionary[keyword]): 

Because, checking the value of a non-existing key raise KeyError. 因为,检查不存在的键的值会引发KeyError。 :so You must check if key exists in dictionary... :所以你必须检查字典中是否存在密钥...

I think you are trying to add 1 to a dictionary entry that doesn't yet exist. 我想您正在尝试将1添加到尚不存在的字典条目中。 Your getitem method is for some reason returning a new instance of the AutoVivification class when a lookup fails. 由于某种原因,getitem方法在查找失败时返回AutoVivification类的新实例。 You're therefore trying to add 1 to a new instance of the class. 因此,您尝试将1添加到该类的新实例中。

I think the answer is to update the getitem method so that it sets the counter to 0 if it doesn't yet exist. 我认为答案是更新getitem方法,以便在计数器尚不存在时将其设置为0。

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            self[item] = 0
            return 0

Hope this helps. 希望这可以帮助。

Not sure why you need nested dicts here. 不知道为什么你需要嵌套的dicts。 In a typical index scenario you have a forward index mapping 在典型的索引方案中,您有一个正向索引映射

document id -> [word_ids] 文件ID - > [word_ids]

and an inverse index mapping 和反向索引映射

word_id -> [document_ids] word_id - > [document_ids]

Not sure if this is related here but using two indexes you can perform all kind of queries very efficiently and the implementation is straight forward since you don't need to deal with nested data structures. 不确定这是否与此相关,但使用两个索引可以非常有效地执行所有类型的查询,并且实现很简单,因为您不需要处理嵌套数据结构。

In the AutoVivification class, you define 在AutoVivification类中,您可以定义

value = self[item] = type(self)()
return value

which returns an instance of self, which is an AutoVivification in that context. 返回self的一个实例,该实例是该上下文中的AutoVivification。 The error becomes then clear. 然后错误变得清晰。

Are you sure you want to return an AutoVivification on any missing key query? 您确定要在任何缺少的密钥查询上返回AutoVivification吗? From the code, I would assume you want to return a normal dictionary with string key and int values. 从代码中,我假设您想要返回一个包含字符串键和int值的普通字典。

By the way, maybe you would be interested in the defaultdict class. 顺便说一下,也许你会对defaultdict类感兴趣。

It would be better to kick AutoVivification out all together, because it adds nothing. AutoVivification全部放在一起会更好,因为它什么都不添加。

The following line: 以下行:

if (word == keyword and dictionary[keyword][filename] is not None):

Doesn't work as expected, because of the way your class works, dictionary[keyword] will always return an instance of AutoVivification , and so will dictionary[keyword][filename] . 由于您的类的工作方式不能正常工作,因此dictionary[keyword]将始终返回AutoVivification的实例,因此dictionary[keyword][filename]

This AutoVivification class is not the magic you are looking for. 此AutoVivification类不是您正在寻找的魔力。

Check out collections.defaultdict from the standard library. 查看标准库中的collections.defaultdict Your inner dicts should be defaultdicts that default to integer values, and your outer dicts would then be defaultdicts that default to inner-dict values. 你的内部dicts应该是默认为整数值的默认值,而你的外部dicts则是默认为inner-dict值的默认值。

#!/usr/bin/env python
# encoding: utf-8
from os.path import join
from glob import glob as glob_
from collections import defaultdict, Counter
from string import punctuation

WORKDIR  = 'temp/'
FILETYPE = '*.html'
OUTF     = 'doc_{0}'.format

def extract(text, startTag='<pre>', endTag='</pre>'):
    """Extract text between start tag and end tag

    Start at first char following first occurrence of startTag
      If none, begin at start of text
    End at last char preceding first subsequent occurrence of endTag
      If none, end at end of text
    """
    return text.split(startTag, 1)[-1].split(endTag, 1)[0]    

def main():
    DocWords = defaultdict(dict)

    infnames = glob_(join(WORKDIR, FILETYPE))
    for docId,infname in enumerate(infnames, 1):
        outfname = OUTF(docId)
        with open(infname) as inf:
            text = inf.read().lower()
        words = extract(text).strip(punctuation).split()
        for wd,num in Counter(words).iteritems():
            DocWords[wd][outfname] = num

if __name__ == '__main__':
    main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM