[英]Logic behind program faulty, program doesn't produce correct output
这是一个Python代码,用于查找令牌类型比率(代码中下面给出的所有定义)。 我无法获得正确的值。 我怀疑我的逻辑有问题,无法调试我的逻辑。 我将不胜感激任何帮助
def type_token_ratio(text):
"""
(list of str) -> float
Precondition: text is non-empty. Each str in text ends with \n and
text contains at least one word.
Return the Type Token Ratio (TTR) for this text. TTR is the number of
different words divided by the total number of words.
>>> text = ['James Fennimore Cooper\n', 'Peter, Paul, and Mary\n',
'James Gosling\n']
>>> type_token_ratio(text)
0.8888888888888888
"""
x = 0
while x < len(text):
text[x] = text[x].replace('\n', '')
x = x + 1
index = 0
counter = 0
number_of_words = 0
words = ' '.join(text)
words = clean_up(words)
words = words.replace(',', '')
lst_of_words = words.split()
for word1 in lst_of_words:
while index < len(lst_of_words):
if word1 == lst_of_words[index]:
counter = counter + 1
index = index + 1
return ((len(lst_of_words) - counter)/len(lst_of_words))
有一种更简单的方法-使用collections模块:
import collections
def type_token_ratio(text):
""" (list of str) -> float
Precondition: text is non-empty. Each str in text ends with \n and
text contains at m one word.
Return the Type Token Ratio (TTR) for this text. TTR is the number of
different words divided by the total number of words.
>>> text = ['James Fennimore Cooper\n', 'Peter, Paul, and Mary\n',
'James Gosling\n']
>>> type_token_ratio(text)
0.8888888888888888
"""
words = " ".join(text).split() # Give a list of all the words
counts = collections.Counter(words)
all = sum([counts[i] for i in counts])
unique = len(counts)
return float(unique)/all
或@Yoel指出-还有一种更简单的方法:
def type_token_ratio(text):
words = " ".join(text).split() # Give a list of all the words
return len(set(words))/float(len(words))
在这里,您可能想要编写什么(从-for-开始替换您的代码)。
init_index=1
for word1 in lst_of_words:
index=init_index
while index < len(lst_of_words):
if word1 == lst_of_words[index]:
counter=counter+1
break
index = index + 1
init_index = init_index + 1
print word1
print counter
r=(float(len(lst_of_words) - counter))/len(lst_of_words)
print '%.2f' % r
return r
=> index = init_index实际上是word1之后的单词的索引; 搜索总是从下一个单词开始。
=> break:不计入多次相同的事件,一次计数用于迭代。
您正在搜索列表的其余部分中是否存在与此单词重复的单词(因为之前的迭代已经完成了该单词)
应当注意不要重述多次发生的小腿病,这就是为什么要休息的原因。 如果同一单词有多个出现,则将在进一步的迭代中找到下一个出现。
不是防弹的,根据您的代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.