大小差异来自何处？

Question

I created a trie of sorts to store all the words (not definitions) in the English dictionary. 我创建了一个trie来将所有单词（非定义）存储在英语词典中。 The point of it was so that I can get all the words that only contain letters within a given range. 这样做的目的是让我可以得到在给定范围内仅包含字母的所有单词。

The text file containing all the words is about 2.7 mb, but after creating the tree and writing it to a file using pickle, the file is >33 mb. 包含所有单词的文本文件大约为2.7 mb，但是在创建树并将其使用pickle写入文件后，文件的大小为> 33 mb。

Where does this difference in size come from? 大小上的差异从何而来？ I thought I would be saving space by not needing to store multiple copies of the same letter for different word, eg for the words app and apple I would only need 5 nodes, for a -> p -> p -> l -> e. 我以为我可以节省空间，因为不需要为不同的单词存储同一字母的多个副本，例如对于单词app和apple我只需要5个节点，对于-> p-> p-> l-> e 。

My code is as follows: 我的代码如下：

import pickle

class WordTrieNode:
    def __init__(self, nodeLetter='', parentNode=None, isWordEnding=False):
        self.nodeLetter = nodeLetter
        self.parentNode = parentNode
        self.isWordEnding = isWordEnding
        self.children = [None]*26 # One entry for each lowercase letter of the alphabet

    def getWord(self):
        if(self.parentNode is None):
            return ''

        return self.parentNode.getWord() + self.nodeLetter

    def isEndOfWord(self):
        return self.isWordEnding

    def markEndOfWord():
        self.isWordEnding = True

    def insertWord(self, word):
        if(len(word) == 0):
            return

        char = word[0]
        idx = ord(char) - ord('a')
        if(len(word) == 1):
            if(self.children[idx] is None):
                node = WordTrieNode(char, self, True)
                self.children[idx] = node
            else:
                self.children[idx].markEndOfWord()
        else:
            if(self.children[idx] is None):
                node = WordTrieNode(char, self, False)
                self.children[idx] = node
                self.children[idx].insertWord(word[1:])
            else:
                self.children[idx].insertWord(word[1:])

    def getAllWords(self):
        for node in self.children:
            if node is not None:
                if node.isEndOfWord():
                    print(node.getWord())
                node.getAllWords()

    def getAllWordsInRange(self, low='a', high='z'):
        i = ord(low) - ord('a')
        j = ord(high) - ord('a')
        for node in self.children[i:j+1]:
            if node is not None:
                if node.isEndOfWord():
                    print(node.getWord())
                node.getAllWordsInRange(low, high)



def main():

    tree = WordTrieNode("", None, False)

    with open('en.txt') as file:
        for line in file:
            tree.insertWord(line.strip('\n'))
    with open("treeout", 'wb') as output:
        pickle.dump(tree, output, pickle.HIGHEST_PROTOCOL)

    #tree.getAllWordsInRange('a', 'l')
    #tree.getAllWords()
if __name__ == "__main__":
    main()

Answer 1

Nodes of a trie are huge as they store a link for all possible next letters. 特里的节点很大，因为它们存储着所有可能的下一个字母的链接。 As you can see in the code, every node holds a list of 26 links (children). 正如您在代码中看到的那样，每个节点都包含26个链接（子代）的列表。

More compact schemes are possible ( https://en.wikipedia.org/wiki/Trie#Compressing_tries ), at the expense of more complexity and slower speed. 可能会有更紧凑的方案（ https://en.wikipedia.org/wiki/Trie#Compressing_tries ），但代价是更加复杂且速度较慢。

大小差异来自何处？

问题描述

1 个解决方案

解决方案1
5 已采纳 2016-05-26 07:06:05

大小差异来自何处？

问题描述

1 个解决方案

解决方案1 5 已采纳 2016-05-26 07:06:05

解决方案1
5 已采纳 2016-05-26 07:06:05