簡體   English   中英

Python-NLTK Bigram Keep <s>和</s>一個詞

[英]Python - NLTK Bigram Keep <s> and </s> as one word

我正在嘗試創建一個程序來計算bigram概率。 我的第一步是弄清楚一個句子的組合。

這些句子中的每一個以<s>開頭,以</s>結尾。 所以可以說我的例句是<s> my name is python </s> ,我的結果應該是(我有p標簽,因為我可以算出之后的概率)

p(my | <s>)
p(name | my )
p (is | name)
p (python | is)
p (</s> | python)

但是相反,我會得到如下結果:

Counter({('<', 's'): 1, ('s', '>'): 1, ('>', 'my'): 1, ('my', 'name'): 1, ('name', 'is'): 1, ('is', 'python'): 1, ('python', '<'): 1, ('<', '/s'): 1, ('/s', '>'): 1})

如何將<s></s>分離為單獨的單詞,而不拆分。

我的代碼是:

text = "<s> my name is python </s>" 
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)

print(Counter(bigrams))

編輯

可以說我有一個文本文件

<s> a a b b c c </s> <s> a c b c </s> <s> b c c a b </s>

然后,我打開此文本文件並對其執行以下操作,並將其存儲在列表中。

temp = re.split("\s+",line.rstrip('\n'))
bigramText.append(temp)

所以現在在我的清單中,我有:

[['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]

現在從這個階段開始,我要進行計算以獲得二元概率。 我不知道我最初的問題是否可以幫助您得到結果,但必不可少的是,我試圖找出這些組合發生了多少次,即您需要檢查一個字母在另一個字母旁邊出現了多少次。

如果可以按空間划分,則可能應該編寫自己的bigrammizer(這通常是艱難的條件。我會保持我的身份而不是我的身份)

def custom_bigrams(l):
    return list(zip(l, l[1:]))
print(custom_bigrams(['<s>', 'my', 'name', 'is', 'python', '</s>']))

它打印

[('<s>', 'my'), ('my', 'name'), ('name', 'is'), ('is', 'python'), ('python', '</s>')]

要在列表中使用它,您必須計算二元組,然后使用Counter中的更新方法。

your_list = [['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]

c = Counter()
for x in your_list:
     c.update(custom_bigrams(x))

產量

 Counter({('b', 'c'): 3, ('<s>', 'a'): 2, ('a', 'b'): 2, ('c', 'c'): 2, ('c', '</s>'): 2, ('a', 'a'): 1, ('b', 'b'): 1, ('a', 'c'): 1, ('c', 'b'): 1, ('<s>', 'b'): 1, ('c', 'a'): 1, ('b', '</s>'): 1})

NLTK標記生成器在對'<s>''</s>'分段時會出錯,您應該在調用標記生成器之前將其刪除,然后在標記生成之后添加它們,這是正常的。

text = "<s> my name is python </s>" 
clean_text = text.replace('<s>','').replace('</s>','')
token =  ['<s>'] + nltk.word_tokenize(clean_text) + ['</s>']
bigrams = ngrams(token,2)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM