[英]Python NLTK :: Intersecting words and sentences
我正在使用NLTK-用於處理語料庫文本的特定工具包,並且我定義了一個函數,用於將用戶輸入與莎士比亞的單詞相交。
def shakespeareOutput(userInput):
user = userInput.split()
user = random.sample(set(user), 3)
#here is NLTK's method
play = gutenberg.sents('shakespeare-hamlet.txt')
#all lowercase
hamlet = map(lambda sublist: map(str.lower, sublist), play)
print hamlet
退貨:
[ ['[', 'the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare', '1599', ']'],
['actus', 'primus', '.'],
['scoena', 'prima', '.'],
['enter', 'barnardo', 'and', 'francisco', 'two', 'centinels', '.'],
['barnardo', '.'],
['who', "'", 's', 'there', '?']...['finis', '.'],
['the', 'tragedie', 'of', 'hamlet', ',', 'prince', 'of', 'denmarke', '.']]
我想找到包含最多用戶單詞的句子並返回該句子。 我在嘗試:
bestCount = 0
for sent in hamlet:
currentCount = len(set(user).intersection(sent))
if currentCount > bestCount:
bestCount = currentCount
answer = ' '.join(sent)
return ''.join(answer).lower(), bestCount
調用函數:
shakespeareOutput("The Actus Primus")
返回:
['The', 'Actus', 'Primus']
None
我究竟做錯了什么?
提前致謝。
您評估currentCount
是錯誤的。 集合交集返回匹配的不同元素的數量,而不是匹配元素的計數。
>>> s = [1,1,2,3,3,4]
>>> u = set([1,4])
>>> u.intersection(s)
set([1, 4]) # the len is 2, however the total number matched elements are 3
使用以下代碼。
bestCount = 0
for sent in hamlet:
currentCount = sum([sent.count(i) for i in set(user)])
if currentCount > bestCount:
bestCount = currentCount
answer = ' '.join(sent)
return answer.lower(), bestCount
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.