简体   繁体   English

NLTK困惑度测度反演

[英]NLTK Perplexity measure inversion

I have given a train text and a test text. 我提供了培训文字和测试文字。 What I want to do is to train a language model by train data to calculate the perplexity of the test data. 我要做的是通过训练数据来训练语言模型,以计算测试数据的困惑度。

This is my code: 这是我的代码:

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends

from nltk import word_tokenize, sent_tokenize 

fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()

# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace

n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)


model = Laplace(n) 
model.fit(train_data, padded_sents)

print(model.perplexity(trainTest)) 

When I run this code with n=1, which is unigram, I get "1068.332393940235" . 当我以n = 1(单字)运行此代码时,得到"1068.332393940235" With n=2, or bigram, I get "1644.3441077259993" , and with trigrams I get 2552.2085752565313 . 在n = 2或bigram的情况下,我得到"1644.3441077259993" ,在三2552.2085752565313组的情况下,我得到2552.2085752565313

What is the problem with it? 有什么问题吗?

The way you are creating the test data is wrong (lower case train data but test data is not coverted to lowercase. Start and end tokens missing in test data). 创建测试数据的方式是错误的(小写火车数据,但测试数据未转换为小写。开始和结束标记在测试数据中丢失)。 Try this 尝试这个

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
from nltk import word_tokenize, sent_tokenize 

"""
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()
"""
textTest = "This is an ant. This is a cat"
text = "This is an orange. This is a mango"

n = 2
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(textTest)]
test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model = Laplace(1) 
model.fit(train_data, padded_sents)

s = 0
for i, test in enumerate(test_data):
    p = model.perplexity(test)
    s += p

print ("Perplexity: {0}".format(s/(i+1)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM