简体   繁体   English

使用numpy数组字典时出现问题(索引错误)

[英]Problem using a dictionary of numpy array(Indexing it wrong)

I'm trying to code the Gaussian Naive Bayes from scratch using python and numpy but I'm having some troubles to create the word frequency table. 我正在尝试使用python和numpy从零开始对高斯朴素贝叶斯进行编码,但是在创建单词频率表时遇到了一些麻烦。

I have a dictionary of N words as keys and each one of these N words has a numpy array associated. 我有N个单词的字典作为键,这N个单词中的每个单词都有一个关联的numpy数组。

Example: 例:

freq_table['subject'] -> Vector of ocurrences of this word of length nrows where nrows is the size of the dataset.

So for each row in the dataset I'm doing: freq_table[WORD][i] += 1 因此,对于数据集中的每一行,我正在做: freq_table[WORD][i] += 1

def train(self, X):
        # Creating the dictionary
        self.dictionary(X.data[:100])

        # Calculating the class prior probabilities
        self.p_class = self.prior_probs(X.target)

        # Calculating the likelihoods
        nrows = len(X.data[:100])
        freq = dict.fromkeys(self._dict, nrows * [0])

        for doc, target, i in zip(X.data[:2], X.target[:2], range(2)):
            print('doc [%d] out of %d' % (i, nrows))

            words = preprocess(doc)

            print(len(words), i)

            for j, w in enumerate(words):
                print(w, j)

                # Getting the vector assigned by the word w
                vec = freq[w]

                # In the ith position (observation id) sum one of ocurrence
                vec[i] += 1

        print(freq['subject'])

The output is 输出是

Dictionary length 4606

doc [0] out of 100
43 0
wheres 0
thing 1
subject 2
nntppostinghost 3
racwamumdedu 4
organization 5
university 6
maryland 7
college 8
lines 9
wondering 10
anyone 11
could 12
enlighten 13
sports 14
looked 15
early 16
called 17
bricklin 18
doors 19
really 20
small 21
addition 22
front 23
bumper 24
separate 25
anyone 26
tellme 27
model 28
engine 29
specs 30
years 31
production 32
history 33
whatever 34
funky 35
looking 36
please 37
email 38
thanks 39
brought 40
neighborhood 41
lerxst 42
[43, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

It seems that I'm indexing the dictionary and vector wrong. 看来我在为字典和向量建立索引错误。

It was not supposed to be 43 or 53 occurrences for the word 'subject' because the length of the preprocessed words from the document/row is 43/53. 单词“ subject”的出现次数不应为43或53,因为来自文档/行的预处理单词的长度为43/53。

The code has at least two errors: 该代码至少有两个错误:

1) In the line 1)在行

freq = dict.fromkeys(self._dict, nrows * [0])

You initialize all items in the freq dictionary with the same list. 您可以使用相同的列表初始化freq词典中的所有项目。 nrows * [0] is evaluated once to create a list, which is then passed to the dict.fromkeys() function. nrows * [0]被评估一次以创建一个列表,然后将其传递到dict.fromkeys()函数。 The reference to this one list is assigned to all of the keys in the freq dictionary. 对该列表的引用将分配给freq词典中的所有键。 No matter which key you select, you get a reference to the same list. 无论您选择哪个键,都将获得对同一列表的引用。 This is a common gotcha in Python. 这是Python中常见的陷阱。

Instead, you can use a dictionary comprehension to create the entries with separate lists: 相反,您可以使用字典理解来创建带有单独列表的条目:

freq = {key:nrows*[0] for key in self._dict}

2) You use i as your indexing variable for the vec , but you meant to use j : 2)您将i用作vec索引变量,但您打算使用j

vec[j] += 1

Using variables with descriptive names would help avoid this type of confusion. 使用具有描述性名称的变量将有助于避免这种混淆。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM