[英]Python - NLTK Bigram Keep <s> and </s> as one word
我正在嘗試創建一個程序來計算bigram概率。 我的第一步是弄清楚一個句子的組合。
這些句子中的每一個以<s>
開頭,以</s>
結尾。 所以可以說我的例句是<s> my name is python </s>
,我的結果應該是(我有p標簽,因為我可以算出之后的概率)
p(my | <s>)
p(name | my )
p (is | name)
p (python | is)
p (</s> | python)
但是相反,我會得到如下結果:
Counter({('<', 's'): 1, ('s', '>'): 1, ('>', 'my'): 1, ('my', 'name'): 1, ('name', 'is'): 1, ('is', 'python'): 1, ('python', '<'): 1, ('<', '/s'): 1, ('/s', '>'): 1})
如何將<s>
和</s>
分離為單獨的單詞,而不拆分。
我的代碼是:
text = "<s> my name is python </s>"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
print(Counter(bigrams))
編輯
可以說我有一個文本文件
<s> a a b b c c </s> <s> a c b c </s> <s> b c c a b </s>
然后,我打開此文本文件並對其執行以下操作,並將其存儲在列表中。
temp = re.split("\s+",line.rstrip('\n'))
bigramText.append(temp)
所以現在在我的清單中,我有:
[['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]
現在從這個階段開始,我要進行計算以獲得二元概率。 我不知道我最初的問題是否可以幫助您得到結果,但必不可少的是,我試圖找出這些組合發生了多少次,即您需要檢查一個字母在另一個字母旁邊出現了多少次。
如果可以按空間划分,則可能應該編寫自己的bigrammizer(這通常是艱難的條件。我會保持我的身份而不是我的身份)
def custom_bigrams(l):
return list(zip(l, l[1:]))
print(custom_bigrams(['<s>', 'my', 'name', 'is', 'python', '</s>']))
它打印
[('<s>', 'my'), ('my', 'name'), ('name', 'is'), ('is', 'python'), ('python', '</s>')]
要在列表中使用它,您必須計算二元組,然后使用Counter中的更新方法。
your_list = [['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]
c = Counter()
for x in your_list:
c.update(custom_bigrams(x))
產量
Counter({('b', 'c'): 3, ('<s>', 'a'): 2, ('a', 'b'): 2, ('c', 'c'): 2, ('c', '</s>'): 2, ('a', 'a'): 1, ('b', 'b'): 1, ('a', 'c'): 1, ('c', 'b'): 1, ('<s>', 'b'): 1, ('c', 'a'): 1, ('b', '</s>'): 1})
NLTK標記生成器在對'<s>'
和'</s>'
分段時會出錯,您應該在調用標記生成器之前將其刪除,然后在標記生成之后添加它們,這是正常的。
text = "<s> my name is python </s>"
clean_text = text.replace('<s>','').replace('</s>','')
token = ['<s>'] + nltk.word_tokenize(clean_text) + ['</s>']
bigrams = ngrams(token,2)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.