[英]how can I divide the frequency of bigram pair by unigram word?
below is my code. 下面是我的代码。
from __future__ import division
import nltk
import re
f = open('C:/Python27/brown_A1_half.txt', 'rU')
w = open('C:/Python27/brown_A1_half_Out.txt', 'w')
#to read whole file using read()
filecontents = f.read()
from nltk.tokenize import sent_tokenize
sent_tokenize_list = sent_tokenize(filecontents)
for sentence in sent_tokenize_list:
sentence = "Start " + sentence + " End"
tokens = sentence.split()
bigrams = (tuple(nltk.bigrams(tokens)))
bigrams_frequency = nltk.FreqDist(bigrams)
for k,v in bigrams_frequency.items():
print k, v
then the printing result is "(bigrams), its frequency ". 则打印结果为“(字母),其频率”。 here, what I want is for each bigram pair, divide the bigram frequency by the first appearing unigram word frequency.
在这里,我想要的是每个双字母对,将双字母频率除以第一个出现的字母字词频率。 (for example, if there is a bigram ('red', 'apple') and its frequency is "3", then I want to divide it by the frequency of 'red').
(例如,如果有一个双字母组(“ red”,“ apple”),其频率为“ 3”,那么我想将其除以“ red”的频率)。 This is for obtaining the MLE prob, that is "MLE prob = Counting of (w1, w2) / Counting of (w1)" .
这是为了获得MLE概率,即“ MLE概率=(w1,w2)的计数/(w1)的计数”。 help me plz...
帮我...
You can add the following in the for loop (after print k, v): 您可以在for循环中添加以下内容(在打印k,v之后):
number_unigrams = tokens.count(k[0])
prob = v / number_unigrams
That should give you the MLE prob for each bigram. 这应该为您提供每个二元组的MLE概率。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.