简体   繁体   English

在整个词典中错误地重复了子词典?

[英]Sub-dictionary erroneously repeated throughout dictionary?

I'm trying to store in a dictionary the number of times a given letter occurs after another given letter. 我正在尝试在字典中存储给定字母在另一个给定字母之后出现的次数。 For example, dictionary['a']['d'] would give me the number of times 'd' follows 'a' in short_list . 例如, dictionary['a']['d']会给我short_list 'd'跟随'a'short_list

alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
short_list = ['ford','hello','orange','apple']

# dictionary to keep track of how often a given letter occurs
tally = {}
for a in alphabet:
    tally[a] = 0

# dictionary to keep track of how often a given letter occurs after a given letter 
# e.g. how many times does 'd' follow 'a' -- master_dict['a']['d']
master_dict = {}
for a in alphabet:
    master_dict[a] = tally

def precedingLetter(letter,word):
    if word.index(letter) == 0:
         return
    else:
         return word[word.index(letter)-1]

for a in alphabet:
    for word in short_list:
        for b in alphabet:
            if precedingLetter(b,word) == a:
                 master_dict[a][b] += 1

However, the entries for all of the letters (the keys) in master_dict are all the same. 但是, master_dict中所有字母(键)的条目都是相同的。 I can't think of another way to properly tally each letter's occurrence after another letter. 我想不出另一种方法来正确地统计每个字母在另一个字母之后的出现。 Can anyone offer some insight here? 有人可以在这里提供一些见解吗?

If the sub- dict s are all supposed to be updated independently after creation, you need to shallow copy them. 如果子dict s的一切应该在创建后独立地进行更新,需要浅复制它们。 Easiest/fastest way is with .copy() : 最简单/最快的方法是使用.copy()

for a in alphabet:
    master_dict[a] = tally.copy()

The other approach is to initialize the dict lazily. 另一种方法是延迟初始化dict The easiest way to do that is with defaultdict : 最简单的方法是使用defaultdict

from collections import defaultdict

masterdict = defaultdict(lambda: defaultdict(int))

# or

from collections import Counter, defaultdict

masterdict = defaultdict(Counter)

No need to pre-create empty tallies or populate masterdict at all, and this avoids creating dict s when the letter never occurs. masterdict不需要预先创建空记数或填充masterdict ,这避免了在字母永不出现时创建dict If you access masterdict[a] for an a that doesn't yet exist, it creates a defaultdict(int) value for it automatically. 如果您访问masterdict[a]对于a尚不存在,它会创建一个defaultdict(int)自动为它的价值。 When masterdict[a][b] is accessed and doesn't exist, the count is initialized to 0 automatically. masterdict[a][b]被访问并且不存在时,该计数将自动初始化为0

In Addition to the first answer it could be handy to perform your search the other way around. 除了第一个答案以外,以其他方式执行搜索也可能很方便。 So instead of looking for each possible pair of letters, you could iterate just over the words. 因此,您无需遍历每对字母,而可以遍历单词。

In combination with the defaultdict this could simplify the process. defaultdict结合使用,可以简化流程。 As an example: 举个例子:

from collections import defaultdict

short_list = ['ford','hello','orange','apple']
master_dict = defaultdict(lambda: defaultdict(int))

for word in short_list:
    for i in range(0,len(word)-1):
        master_dict[word[i]][word[i+1]] += 1

Now master_dict contains all occured letter combinations while it returns zero for all other ones. 现在master_dict包含所有出现的字母组合,而对于其他所有字母组合则返回零。 A few examples below: 以下是一些示例:

print(master_dict["f"]["o"]) # ==> 1
print(master_dict["o"]["r"]) # ==> 2
print(master_dict["a"]["a"]) # ==> 0

The problem you ask about is that the master_dict[a] = tally is only assigning the same object another name, so updating it through any of the references updates them all. 您要问的问题是master_dict[a] = tally仅为同一对象分配了另一个名称,因此通过任何引用对其进行更新都会更新它们。 You could fix that by making a copy of it each time by using master_dict[a] = tally.copy() as already pointed out in @ShadowRanger's answer . 您可以通过使用@ShadowRanger的答案中已经指出的master_dict[a] = tally.copy()每次对其进行复制来解决此问题

As @ShadowRanger goes on to point out, it would also be considerably less wasteful to make your master_dict a defaultdict(lambda: defaultdict(int)) because doing so would only allocate and initialize counts for the combinations that actually encountered rather than all possible 2 letter permutations (if it was used properly). 正如@ShadowRanger继续指出的那样,将您的master_dictdefaultdict(lambda: defaultdict(int))也会大大减少浪费,因为这样做只会分配和初始化实际遇到的组合的计数,而不是所有可能的组合2字母排列(如果使用正确)。

To give you a concert idea of the savings, consider that there are only 15 unique letter pairs in your sample short_list of words, yet the exhaustive approach would still create and initialize 26 placeholders in 26 dictionaries for all 676 the possible counts. 为了使您节省开支,可以考虑一下,在示例short_list中只有15个唯一的字母对,但是详尽的方法仍然会针对所有676种可能的计数在26个词典中创建和初始化26个占位符。

It also occurs to me that you really don't need a two-level dictionary at all to accomplish what you want since the same thing could be done with a single dictionary which had keys comprised of tuples of pairs of characters. 在我看来,您真的根本不需要两级字典来完成所需的操作,因为使用具有由字符对的元组组成的键的单个字典就可以完成相同的操作。

Beyond that, another important improvement, as pointed out in @AdmPicard's answer , is that your approach of iterating through all possible permutations and seeing if any pairs of them are in each word via the precedingLetter() function is significantly more time consuming than it would be if you just iterated over all the successive pairs of letters that actually occurred in each one of them. 除此之外, @ AdmPicard的答案中指出的另一项重要改进是,您遍历所有可能的排列并通过previousLetter precedingLetter()函数查看每个单词中是否有任何成对的方法比其花费的时间明显更多。如果您只是遍历每个字母中实际出现的所有连续字母对。

So, putting all this advice together would result in something like the following: 因此,将所有这些建议放在一起将产生以下内容:

from collections import defaultdict
from string import ascii_lowercase

alphabet = set(ascii_lowercase)
short_list = ['ford','hello','orange','apple']
# dictionary to keep track of how often a letter pair occurred after one other. 
# e.g. how many times 'd' followed an 'a' -> master_dict[('a','d')]
master_dict = defaultdict(int)

try:
    from itertools import izip
except ImportError:  # Python 3
    izip = zip

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = iter(iterable), iter(iterable)  # 2 independent iterators
    next(b, None)                          # advance the 2nd one
    return izip(a, b)

for word in short_list:
    for (ch1,ch2) in pairwise(word.lower()):
        if ch1 in alphabet and ch2 in alphabet:
            master_dict[(ch1,ch2)] += 1

# display results
unique_pairs = 0
for (ch1,ch2) in sorted(master_dict):
    print('({},{}): {}'.format(ch1, ch2, master_dict[(ch1,ch2)]))
    unique_pairs += 1

print('A total of {} different letter pairs occurred in'.format(unique_pairs))
print('the words: {}'.format(', '.join(repr(word) for word in short_list)))

Which produces this output from the short_list : short_list产生以下输出:

(a,n): 1
(a,p): 1
(e,l): 1
(f,o): 1
(g,e): 1
(h,e): 1
(l,e): 1
(l,l): 1
(l,o): 1
(n,g): 1
(o,r): 2
(p,l): 1
(p,p): 1
(r,a): 1
(r,d): 1

A total of 15 different letter pairs occurred in
the words: 'ford', 'hello', 'orange', 'apple'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM