简体   繁体   English

我如何将双字母对的频率除以会标字?

[英]how can I divide the frequency of bigram pair by unigram word?

below is my code. 下面是我的代码。

from __future__ import division
import nltk
import re

f = open('C:/Python27/brown_A1_half.txt', 'rU')
w = open('C:/Python27/brown_A1_half_Out.txt', 'w')

#to read whole file using read()

filecontents = f.read()
from nltk.tokenize import sent_tokenize
sent_tokenize_list = sent_tokenize(filecontents)

for sentence in sent_tokenize_list:
    sentence = "Start " + sentence + " End"
    tokens = sentence.split()
    bigrams = (tuple(nltk.bigrams(tokens)))
    bigrams_frequency = nltk.FreqDist(bigrams)
    for k,v in bigrams_frequency.items():
        print k, v 

then the printing result is "(bigrams), its frequency ". 则打印结果为“(字母),其频率”。 here, what I want is for each bigram pair, divide the bigram frequency by the first appearing unigram word frequency. 在这里,我想要的是每个双字母对,将双字母频率除以第一个出现的字母字词频率。 (for example, if there is a bigram ('red', 'apple') and its frequency is "3", then I want to divide it by the frequency of 'red'). (例如,如果有一个双字母组(“ red”,“ apple”),其频率为“ 3”,那么我想将其除以“ red”的频率)。 This is for obtaining the MLE prob, that is "MLE prob = Counting of (w1, w2) / Counting of (w1)" . 这是为了获得MLE概率,即“ MLE概率=(w1,w2)的计数/(w1)的计数”。 help me plz... 帮我...

You can add the following in the for loop (after print k, v): 您可以在for循环中添加以下内容(在打印k,v之后):

number_unigrams = tokens.count(k[0])
prob = v / number_unigrams

That should give you the MLE prob for each bigram. 这应该为您提供每个二元组的MLE概率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM