[英]how to create the bigram matrix?
I want to make a matrix of the bigram model.我想制作一个二元模型的矩阵。 How can I do it?我该怎么做? Any suggestions which match my code, please?任何与我的代码匹配的建议?
import nltk
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as file:
for line in file:
token=line.split()
spl = 80*len(token)/100
train = token[:int(spl)]
test = token[int(spl):]
print(len(test))
print(len(train))
cn=Counter(train)
known_words=([word for word,v in cn.items() if v>1])# removes the rare words and puts them in a list
bigram=nltk.bigrams(known_words)
frequency=nltk.FreqDist(bigram)
for f in frequency:
print(f,frequency[f])
I need something like:我需要类似的东西:
w1 w2 w3 ....wn
w1 n(w1w1) n(w1w2) n(w1w3) n(w1wn)
w2 n(w2w1) n(w2w1) n(w2w3) n(w2wn)
w3 .
.
.
.
wn
The same for all rows and columns.所有行和列都相同。
Since you need a "matrix" of words, you'll use a dictionary-like class.由于您需要单词的“矩阵”,因此您将使用类似字典的类。 You want a dictionary of all first words in bigrams.您想要一本包含双字母组中所有第一个单词的字典。 To make a two-dimensional matrix, it will be a dictionary of dictionaries: Each value is another dictionary, whose keys are the second words of the bigrams and values are whatever you're tracking (probably number of occurrences).要制作二维矩阵,它将是一个字典字典:每个值都是另一个字典,其键是二元组的第二个单词,值是您要跟踪的任何内容(可能出现的次数)。
In the NLTK you can do it quickly with a ConditionalFreqDist()
:在 NLTK 中,您可以使用ConditionalFreqDist()
快速完成:
mybigrams = nltk.ConditionalFreqDist(nltk.bigrams(brown.words()))
But I recommend you build your bigram table step by step.但我建议你一步一步地建立你的二元表。 You'll understand it better, and you need to before you can use it.你会更好地理解它,你需要在使用它之前。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.