[英]Calculate the number of words and average word length used in the sentence in Python [on hold]
我試圖讓我的代碼計算一個句子中的單詞數(用於在 a.txt 文件上使用它之前進行測試),但它給了我這個結果:
Mr. Blah has a lot of Sr?
and Mrs. blah does not care.
lol
[[1, 1.0], [1, 1.0], [1, 1.0]]
而不是以下結果:
Mr. Blah has a lot of Sr?
and Mrs. blah does not care.
lol
[[7, 2.4], [6, 3.5], [1, 3.0]]
我在下面有我迄今為止工作過的代碼。 它應該計算一個句子中的單詞數。 然后計算一個句子中使用的字母數量。 最后,計算句子中使用的單詞的平均字母數
terminators = ["?", "!"] #Characters that always end a sentence other than a period
abrevs = ["Mrs", "Mr", "Dr", "Fr", "Jr", "Sr"] #Abbreviations that prevent a period from ending a sentence
#Replaced the word_length_list function from 1a. with this new one
def word_length_list(sentence):
print(sentence)
return [1]
#Once a sentence is found, this will calculate statistics for it
def collect_statistics(sentence):
word_lengths = word_length_list(sentence)
words_in_sentence = len(word_lengths) #Get word count
#Average word length
sum_of_word_lengths = 0
for length in word_lengths:
sum_of_word_lengths = sum_of_word_lengths + length
average_word_length = sum_of_word_lengths/words_in_sentence;
return [words_in_sentence, average_word_length]
# Replaced given text with this to test if it does work for the abbreviations and ellipses
story_text = "Mr. Blah has a lot of Sr? and Mrs. blah does not care. lol"
story_length = len(story_text)
statistics = []
sentence = ""
for i in range(story_length):
sentence_over = False # Assumption that this sentence will continue after the next character
nextchar = story_text[i] # Look at the next character in the story
if nextchar in terminators:
sentence_over = True #Change assumption.
#If it is a period, we have some special handling to do.
elif nextchar == ".": #End the sentence after this if-else block.
#But if it is a period, we have to deal with ellipsis and abbreviations
#If the period is followed by another period, probably an ellipsis & want to include in the sentence.
is_part_of_elipse = i+1 < story_length and story_text[i+1] == "."
is_part_of_abbrev = False # Assumption that this sentence will continue after a period, an abbreviation
for ab in abrevs: #Then check for abbreviation
if sentence.endswith(ab):
is_part_of_abbrev = True
if not (is_part_of_elipse or is_part_of_abbrev): # If not part of abbreviation and not part of ellipsis,
sentence_over = True # end of sentence by (period)
sentence = sentence + nextchar;
# Calculate the sentence statistcs
if sentence_over:
statistics.append(collect_statistics(sentence))
# Clear the sentence variable to make room for the next
sentence = ""
#Incase the last sentence was not terminated, add it to the stats
if len(sentence)>0:
statistics.append(collect_statistics(sentence))
with open('collect_statistics.csv', 'w') as csvFile:
writer = csv.writer(csvFile,delimiter = ',')
writer.writerow(statistics)
這個 function 總是返回相同的結果:
def word_length_list(sentence):
print(sentence)
return [1]
您可能想回顧一下計算句子中單詞數量的方式。
你需要解決一些問題。
第一個你的word_length_list
返回[1]
而沒有別的。
將 function 更改為:
def word_length_list(sentence):
return sentence.split()
接下來,我們需要更改collect_statistics
中的一些內容以獲得您正在尋找的結果:
將 function 更改為:
def collect_statistics(sentence):
word_lengths = word_length_list(sentence)
words_in_sentence = len(word_lengths)
sum_of_word_lengths = 0
for word in word_lengths:
sum_of_word_lengths += len(word)
average_word_length = sum_of_word_lengths/words_in_sentence;
return [words_in_sentence, average_word_length]
也就是說,數學中有一些行為會導致一些長小數返回,因此您需要對此進行補償。 我認為我得到的數字稍微多一些,因為代碼仍在計算.
在Sr.
部分和?
所以你預期的 2.4 實際上是 2.7。
更新:
因此,我編寫了自己的版本,該版本縮短了大約 20 行,以執行正確計數的任務。 IE 不計算終止符。
如果您有任何問題,請告訴我:
terminators = ['?', '!', '.']
abrevs = ['Mrs', 'Mr', 'Dr', 'Fr', 'Jr', 'Sr']
story_text = 'Mr. Blah has a lot of Sr? and Mrs. blah does not care. lol'
word_list = story_text.split()
list_of_sentences = []
list_word_len = []
temp_list = []
stats = []
def end_sentence(sub_word):
global temp_list
temp_list.append(sub_word)
list_of_sentences.append(temp_list)
temp_list = []
for ndex, word in enumerate(word_list):
if word[-1:] in terminators:
sub_word = word.replace(word[-1], " ").split()[-1]
if word[-1:] == '.':
if any(abr in word for abr in abrevs):
temp_list.append(sub_word)
else:
end_sentence(sub_word)
else:
end_sentence(sub_word)
else:
if ndex == len(word_list) - 1:
end_sentence(word)
else:
temp_list.append(word)
for sentence in list_of_sentences:
for word in sentence:
list_word_len.append(len(word))
stats.append([len(sentence), round(sum(list_word_len[0:len(list_word_len)]) / len(sentence), 2)])
list_word_len = []
print(stats)
結果:
[[7, 2.43], [6, 3.5], [1, 3.0]]
請注意,我的第一個結果是 2.43 而不是 2.42,這是因為我四舍五入到小數點后第二位。 你不必這樣做。 您可以只取小數點后浮點數的前 2 個值,但我認為舍入比不舍入更接近。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.