朴素貝葉斯多項式

Question

我正在使用朴素貝葉斯多項式模型。 我應該使用Train方法中看到的偽代碼。 這些是我的問題：

1）我已經輸入了大多數代碼，但是主要存在一些問題，例如提取詞匯表，計算類中的文檔數量以及連接類中所有文檔的文本。

2）我還注意到，我需要的火車方法只需要文檔（aka train_doc）。 所以我不知道如何調整以獲得類C。

def train(self, documents):
    # TRAINMULTINOMIALNB(C,D)
    # 1 V <-- EXTRACTVOCABULARY(D)
    # 2 N <-- COUNTDOCS(D)
    # 3 for each c in C
        # 4 do Nc <-- COUNTDOCSINCLASS(D, c)
            # 5 prior[c] <-- Nc/N
            # 6 textc <-- CONCATENATETEXTOFALLDOCSINCLASS(D, c)
            # 7 for each t in V
            # 8 do Tct <-- COUNTTOKENSOFTERM(textc, t)
            # 9 for each t in V
            # 10 do condprob[t][c] <-- Tct+1
    # 11 return V, prior, condprob
    """
    prior={}
    N = len(documents)

    #Vocab
    V = Counter()
    for d in documents:
        V.update(doc[***])

    #COUNTDOCSINCLASS(C,D)
    cdic = Counter(C)
    for d2 in documents:
    for label in C:
            cdic.update({label:int(math.ceil(float(doc[***])))})

    #CONCATENATETEXTOFALLDOCSINCLASS(documents,C)
    ctoadic = defaultdict(Counter)
    for d3 in document:
        for label2 in C:
            if(float(***)>0):
                ctoadic[label].update(doc[***]) 

    #used to get term by class it is in
    tii = defaultdict(Counter)
    for label,word in ctoadic.iteritems():
        for w in word:
            tii[w].update({l:word[w]})

    #getCondProb(tii,ctofadic,C)
    gcp = defaultdict(lambda: defaultdict(float))
    tnw ={} #total number of words in that label
    for l,v inctofadic.iteritems():
        tnwl[l] = sum(v.values())
    for w,count in tii.iteritems():

    #for 0 occurences
    z = [zeroo for zeroo in C if zeroo not in count.keys()]
    for ling in z:
        gcp[w[ling]=1.0/(len(ctofadic[ling])+tnw[ling])
    for ling,val in count.iteritems():
        gcp[w][ling]=float(val+1)/(len(ctofadic[ling])+tnw[ling])

    #Prior    
    for c in C:
        prior[c] = cdic[c] / float(N)
    return V,prior,gcp

Answer 1

對於問題1

對於詞匯表，當您將數據發送到分類器時，還會將您遇到的所有單詞發送到某個通用標簽下。 例如，如果您有這樣的模型：
l = label1，W = word1計數
l = label1，W = word2計數
。
l = label2，W = word3計數
。
l = label3，W = word1計數

等等，還添加類似以下內容：

Vocab,word1 count

Vocab,word2 count

Vocab,word3 count

這里的word1，word2，word3是培訓文檔中遇到的所有單詞，但都是唯一的。 將它們存儲在hashmap中並清除。

然后在分類器中，每當遇到“ Vocab”增加1時，總和就是vocab。

為每個文檔執行相同操作會保留一個不同的計數器，並在遇到新文檔時遞增。

對於問題2

您是否考慮所有課程？

朴素貝葉斯多項式

問題描述

1 個解決方案

解決方案1
0 2015-09-15 21:04:59

朴素貝葉斯多項式

問題描述

1 個解決方案

解決方案1 0 2015-09-15 21:04:59

解決方案1
0 2015-09-15 21:04:59