我如何将双字母对的频率除以会标字？

Question

below is my code. 下面是我的代码。

from __future__ import division
import nltk
import re

f = open('C:/Python27/brown_A1_half.txt', 'rU')
w = open('C:/Python27/brown_A1_half_Out.txt', 'w')

#to read whole file using read()

filecontents = f.read()
from nltk.tokenize import sent_tokenize
sent_tokenize_list = sent_tokenize(filecontents)

for sentence in sent_tokenize_list:
    sentence = "Start " + sentence + " End"
    tokens = sentence.split()
    bigrams = (tuple(nltk.bigrams(tokens)))
    bigrams_frequency = nltk.FreqDist(bigrams)
    for k,v in bigrams_frequency.items():
        print k, v

then the printing result is "(bigrams), its frequency ". 则打印结果为“（字母），其频率”。 here, what I want is for each bigram pair, divide the bigram frequency by the first appearing unigram word frequency. 在这里，我想要的是每个双字母对，将双字母频率除以第一个出现的字母字词频率。 (for example, if there is a bigram ('red', 'apple') and its frequency is "3", then I want to divide it by the frequency of 'red'). （例如，如果有一个双字母组（“ red”，“ apple”），其频率为“ 3”，那么我想将其除以“ red”的频率）。 This is for obtaining the MLE prob, that is "MLE prob = Counting of (w1, w2) / Counting of (w1)" . 这是为了获得MLE概率，即“ MLE概率=（w1，w2）的计数/（w1）的计数”。 help me plz... 帮我...

Answer 1

You can add the following in the for loop (after print k, v): 您可以在for循环中添加以下内容（在打印k，v之后）：

number_unigrams = tokens.count(k[0])
prob = v / number_unigrams

That should give you the MLE prob for each bigram. 这应该为您提供每个二元组的MLE概率。

我如何将双字母对的频率除以会标字？

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-04-22 05:00:13

我如何将双字母对的频率除以会标字？

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-04-22 05:00:13

解决方案1
0 已采纳 2016-04-22 05:00:13