[英]Need help understanding this Python Viterbi algorithm
我正在嘗試將在此Stack Overflow答案中找到的Viterbi算法的Python實現轉換為Ruby。 完整的腳本可以在此問題的底部找到我的評論。
不幸的是,我對Python知之甚少,因此事實證明翻譯比我想要的要困難得多。 盡管如此,我還是取得了一些進展。 現在,唯一使我的大腦完全融化的線是:
prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))
有人可以解釋一下它在做什么嗎?
這是完整的Python腳本:
import re
from itertools import groupby
# text will be a compound word such as 'wickedweather'.
def viterbi_segment(text):
probs, lasts = [1.0], [0]
# Iterate over the letters in the compound.
# eg. [w, ickedweather], [wi, ckedweather], and so on.
for i in range(1, len(text) + 1):
# I've no idea what this line is doing and I can't figure out how to split it up?
prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))
# Append values to arrays.
probs.append(prob_k)
lasts.append(k)
words = []
i = len(text)
while 0 < i:
words.append(text[lasts[i]:i])
i = lasts[i]
words.reverse()
return words, probs[-1]
# Calc the probability of a word based on occurrences in the dictionary.
def word_prob(word):
# dictionary.get(key) will return the value for the specified key.
# In this case, thats the number of occurances of thw word in the
# dictionary. The second argument is a default value to return if
# the word is not found.
return dictionary.get(word, 0) / total
# This ensures we ony deal with full words rather than each
# individual letter. Normalize the words basically.
def words(text):
return re.findall('[a-z]+', text.lower())
# This gets us a hash where the keys are words and the values are the
# number of ocurrances in the dictionary.
dictionary = dict((w, len(list(ws)))
# /usr/share/dixt/words is a file of newline delimitated words.
for w, ws in groupby(sorted(words(open('/usr/share/dict/words').read()))))
# Assign the length of the longest word in the dictionary.
max_word_length = max(map(len, dictionary))
# Assign the total number of words in the dictionary. It's a float
# because we're going to divide by it later on.
total = float(sum(dictionary.values()))
# Run the algo over a file of newline delimited compound words.
compounds = words(open('compounds.txt').read())
for comp in compounds:
print comp, ": ", viterbi_segment(comp)
您正在查看列表理解 。
擴展版本如下所示:
all_probs = []
for j in range(max(0, i - max_word_length), i):
all_probs.append((probs[j] * word_prob(text[j:i]), j))
prob_k, k = max(all_probs)
我希望這有助於解釋。 如果不是,請隨時編輯您的問題並指向您不理解的陳述。
這是一個有效的ruby實現,以防其他人對其有所使用。 我翻譯了上面討論的列表理解,我認為這是適當的慣用的不可讀紅寶石級別。
def viterbi(text)
probabilities = [1.0]
lasts = [0]
# Iterate over the letters in the compound.
# eg. [h ellodarkness],[he llodarkness],...
(1..(text.length + 1)).each do |i|
prob_k, k = ([0, i - maximum_word_length].max...i).map { |j| [probabilities[j] * word_probability(text[j...i]), j] }.map { |s| s }.max_by(&:first)
probabilities << prob_k
lasts << k
end
words = []
i = text.length
while i.positive?
words << text[lasts[i]...i]
i = lasts[i]
end
words.reverse!
[words, probabilities.last]
end
def word_probability(word)
word_counts[word].to_f / word_counts_sum.to_f
end
def word_counts_sum
@word_counts_sum ||= word_counts.values.sum.to_f
end
def maximum_word_length
@maximum_word_length ||= word_counts.keys.map(&:length).max
end
def word_counts
return @word_counts if @word_counts
@word_counts = {"hello" => 12, "darkness" => 6, "friend" => 79, "my" => 1, "old" => 5}
@word_counts.default = 0
@word_counts
end
puts "Best split is %s with probability %.6f" % viterbi("hellodarknessmyoldfriend")
=> Best split is ["hello", "darkness", "my", "old", "friend"] with probability 0.000002
主要的煩惱是python和ruby(打開/關閉間隔)中的范圍定義不同。 該算法非常快。
使用可能性而不是概率可能會比較有利,因為重復的乘法可能導致下溢和/或累積較長單詞的浮點錯誤。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.