[英]How to perform Kneser-Ney smoothing in NLTK at word-level for tri-gram language model?
[英]How to remove List special characters (“()”, “'”,“,”) from the output of a bi / tri-gram in Python
我已經編寫了一個代碼,使用NLTK從文本輸入中計算出二元/三元組頻率。 我在這里面臨的問題是,因為輸出是以Python列表的形式獲得的,所以我的輸出包含列表特定的字符,即(“()”,“'”,“,”)。 我計划將其導出到一個csv文件中,因此我想在代碼級別本身上刪除這些特殊字符。 我該如何進行編輯。
輸入代碼:
import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
from nltk.corpus import stopwords
corpus = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while looked so far into her that, for a while looked so far into her that, for a while the visual
held no memory. Little by little, I returned to myself, waking to nurse the visual held no memory. Little by little, I returned to myself, waking to nurse
'''
s_corpus = corpus.lower()
stop_words = set(stopwords.words('english'))
tokens = nltk.word_tokenize(s_corpus)
tokens = [word for word in tokens if word not in stop_words]
c_tokens = [''.join(e for e in string if e.isalnum()) for string in tokens]
c_tokens = [x for x in c_tokens if x]
bgs_2 = nltk.bigrams(c_tokens)
bgs_3 = nltk.trigrams(c_tokens)
fdist = nltk.FreqDist(bgs_3)
tmp = list()
for k,v in fdist.items():
tmp.append((v,k))
tmp = sorted (tmp, reverse=True)
for kk,vv in tmp[:]:
print (vv,kk)
電流輸出:
('looked', 'far', 'looked') 3
('far', 'looked', 'far') 3
('visual', 'held', 'memory') 2
('returned', 'waking', 'nurse') 2
預期產量:
looked far looked, 3
far looked far, 3
visual held memory, 2
returned waking nurse, 2
感謝您的幫助。
一個更好的問題是ngrams輸出中的那些("()", "'",",")
是什么?
>>> from nltk import ngrams
>>> from nltk import word_tokenize
# Split a sentence into a list of "words"
>>> word_tokenize("This is a foo bar sentence")
['This', 'is', 'a', 'foo', 'bar', 'sentence']
>>> type(word_tokenize("This is a foo bar sentence"))
<class 'list'>
# Extract bigrams.
>>> list(ngrams(word_tokenize("This is a foo bar sentence"), 2))
[('This', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'sentence')]
# Okay, so the output is a list, no surprise.
>>> type(list(ngrams(word_tokenize("This is a foo bar sentence"), 2)))
<class 'list'>
但是是什么類型的('This', 'is')
?
>>> list(ngrams(word_tokenize("This is a foo bar sentence"), 2))[0]
('This', 'is')
>>> first_thing_in_output = list(ngrams(word_tokenize("This is a foo bar sentence"), 2))[0]
>>> type(first_thing_in_output)
<class 'tuple'>
嗯,這是一個元組,請參閱https://realpython.com/python-lists-tuples/
打印元組時會發生什么?
>>> print(first_thing_in_output)
('This', 'is')
如果將它們轉換為str()
會發生什么?
>>> print(str(first_thing_in_output))
('This', 'is')
但是我想要輸出This is
而不是('This', 'is')
,所以我將使用str.join()
函數,請參見https://www.geeksforgeeks.org/join-function-python/ :
>>> print(' '.join((first_thing_in_output)))
This is
現在,這是真正學習基本Python類型教程以了解正在發生的事情的好時機。 另外,最好也了解“容器”類型如何工作,例如https://github.com/usaarhat/pywarmups/blob/master/session2.md
在原始帖子中,代碼存在很多問題。
我猜代碼的目標是:
棘手的部分是stopwords.words('english')
不包含標點符號,因此您最終會得到包含標點符號的奇怪ngrams:
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while looked so far into her that, for a while looked so far into her that, for a while the visual
held no memory. Little by little, I returned to myself, waking to nurse the visual held no memory. Little by little, I returned to myself, waking to nurse
'''
stoplist = set(stopwords.words('english'))
tokens = [token for token in nltk.word_tokenize(text) if token not in stoplist]
list(ngrams(tokens, 2))
[OUT]:
[('The', 'pure'),
('pure', 'amnesia'),
('amnesia', 'face'),
('face', ','),
(',', 'newborn'),
('newborn', '.'),
('.', 'I'),
('I', 'looked'),
('looked', 'far'),
('far', ','),
(',', ','), ...]
也許您想使用標點符號來擴展非索引字表,例如
from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while looked so far into her that, for a while looked so far into her that, for a while the visual
held no memory. Little by little, I returned to myself, waking to nurse the visual held no memory. Little by little, I returned to myself, waking to nurse
'''
stoplist = set(stopwords.words('english') + list(punctuation))
tokens = [token for token in nltk.word_tokenize(text) if token not in stoplist]
list(ngrams(tokens, 2))
[OUT]:
[('The', 'pure'),
('pure', 'amnesia'),
('amnesia', 'face'),
('face', 'newborn'),
('newborn', 'I'),
('I', 'looked'),
('looked', 'far'),
('far', 'looked'),
('looked', 'far'), ...]
然后,您意識到像I
這樣的標記應該是一個停用詞,但仍然存在於您的ngram列表中。 這是因為stopwords.words('english')
中的列表是小寫的,例如
>>> stopwords.words('english')
[OUT]:
['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're", ...]
因此,當您檢查令牌是否在非索引字表中時,還應該將令牌小寫。 ( 避免在word_tokenize
之前小寫句子,因為word_tokenize
可能會從大寫字母中得到提示)。 從而:
from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while looked so far into her that, for a while looked so far into her that, for a while the visual
held no memory. Little by little, I returned to myself, waking to nurse the visual held no memory. Little by little, I returned to myself, waking to nurse
'''
stoplist = set(stopwords.words('english') + list(punctuation))
tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stoplist]
list(ngrams(tokens, 2))
[OUT]:
[('pure', 'amnesia'),
('amnesia', 'face'),
('face', 'newborn'),
('newborn', 'looked'),
('looked', 'far'),
('far', 'looked'),
('looked', 'far'),
('far', 'looked'),
('looked', 'far'),
('far', 'looked'), ...]
現在,ngram看起來已經實現了目標:
然后,在要按順序將ngram輸出到文件的最后一部分上,您實際上可以使用Freqdist.most_common()
,它會以降序排列,例如
from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk import FreqDist
text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while looked so far into her that, for a while looked so far into her that, for a while the visual
held no memory. Little by little, I returned to myself, waking to nurse the visual held no memory. Little by little, I returned to myself, waking to nurse
'''
stoplist = set(stopwords.words('english') + list(punctuation))
tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stoplist]
FreqDist(ngrams(tokens, 2)).most_common()
[OUT]:
[(('looked', 'far'), 4),
(('far', 'looked'), 3),
(('visual', 'held'), 2),
(('held', 'memory'), 2),
(('memory', 'Little'), 2),
(('Little', 'little'), 2),
(('little', 'returned'), 2),
(('returned', 'waking'), 2),
(('waking', 'nurse'), 2),
(('pure', 'amnesia'), 1),
(('amnesia', 'face'), 1),
(('face', 'newborn'), 1),
(('newborn', 'looked'), 1),
(('far', 'visual'), 1),
(('nurse', 'visual'), 1)]
(另請參見: Python的collections.Counter和nltk.probability.FreqDist之間的區別 )
最后,最后將其打印到文件中,您應該真正使用上下文管理器http://eigenhombre.com/introduction-to-context-managers-in-python.html
with open('bigrams-list.tsv', 'w') as fout:
for bg, count in FreqDist(ngrams(tokens, 2)).most_common():
print('\t'.join([' '.join(bg), str(count)]), end='\n', file=fout)
[雙字母組-list.tsv]:
looked far 4
far looked 3
visual held 2
held memory 2
memory Little 2
Little little 2
little returned 2
returned waking 2
waking nurse 2
pure amnesia 1
amnesia face 1
face newborn 1
newborn looked 1
far visual 1
nurse visual 1
現在您看到了這個奇怪的二元組Little little
, 有意義嗎?
這是去除的副產品by
從
一點一點地
因此,現在,根據您提取的ngram的最終任務是什么,您可能真的不想從列表中刪除停用詞。
因此,僅是要“修復”您的輸出:使用它來打印數據:
for kk,vv in tmp:
print(" ".join(list(kk)),",%d" % vv)
但是,如果要將其解析為csv,則應以其他格式收集輸出。
當前,您正在創建包含一個教堂和一個數字的教堂列表。 嘗試將數據收集為包含每個值的列表的列表。 這樣,您可以將其直接寫入到csv文件中。
在這里看看: 用Python列表中的值創建一個.csv文件
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.