簡體   English   中英

計算Python中句子使用的詞數和平均詞長[暫停]

[英]Calculate the number of words and average word length used in the sentence in Python [on hold]

我試圖讓我的代碼計算一個句子中的單詞數(用於在 a.txt 文件上使用它之前進行測試),但它給了我這個結果:

Mr. Blah has a lot of Sr?
 and Mrs. blah does not care.
 lol
[[1, 1.0], [1, 1.0], [1, 1.0]]

而不是以下結果:

Mr. Blah has a lot of Sr?
 and Mrs. blah does not care.
 lol
[[7, 2.4], [6, 3.5], [1, 3.0]]

我在下面有我迄今為止工作過的代碼。 它應該計算一個句子中的單詞數。 然后計算一個句子中使用的字母數量。 最后,計算句子中使用的單詞的平均字母數

terminators = ["?", "!"] #Characters that always end a sentence other than a period
abrevs = ["Mrs", "Mr", "Dr", "Fr", "Jr", "Sr"] #Abbreviations that prevent a period from ending a sentence

#Replaced the word_length_list function from 1a. with this new one
def word_length_list(sentence):
    print(sentence)
    return [1]

#Once a sentence is found, this will calculate statistics for it
def collect_statistics(sentence):
    word_lengths = word_length_list(sentence)
    words_in_sentence = len(word_lengths)  #Get word count

    #Average word length
    sum_of_word_lengths = 0
    for length in word_lengths:
        sum_of_word_lengths = sum_of_word_lengths + length
    average_word_length = sum_of_word_lengths/words_in_sentence;

    return [words_in_sentence, average_word_length]
# Replaced given text with this to test if it does work for the abbreviations and ellipses
story_text = "Mr. Blah has a lot of Sr? and Mrs. blah does not care. lol"

story_length = len(story_text)

statistics = []

sentence = ""

for i in range(story_length):
    sentence_over = False # Assumption that this sentence will continue after the next character
    nextchar = story_text[i] # Look at the next character in the story

    if nextchar in terminators:
        sentence_over = True  #Change assumption.  
                              #If it is a period, we have some special handling to do.
    elif nextchar == ".": #End the sentence after this if-else block.
                          #But if it is a period, we have to deal with ellipsis and abbreviations

        #If the period is followed by another period, probably an ellipsis & want to include in the sentence.
        is_part_of_elipse = i+1 < story_length and story_text[i+1] == "."

        is_part_of_abbrev = False  # Assumption that this sentence will continue after a period, an abbreviation

        for ab in abrevs: #Then check for abbreviation
            if sentence.endswith(ab):
                is_part_of_abbrev = True

        if not (is_part_of_elipse or is_part_of_abbrev): # If not part of abbreviation and not part of ellipsis, 
            sentence_over = True                         # end of sentence by (period)

    sentence = sentence + nextchar;

    # Calculate the sentence statistcs
    if sentence_over:
        statistics.append(collect_statistics(sentence))
        # Clear the sentence variable to make room for the next
        sentence = ""

#Incase the last sentence was not terminated, add it to the stats
if len(sentence)>0:
    statistics.append(collect_statistics(sentence))


with open('collect_statistics.csv', 'w') as csvFile:
        writer = csv.writer(csvFile,delimiter = ',')
        writer.writerow(statistics)

這個 function 總是返回相同的結果:

def word_length_list(sentence):
    print(sentence)
    return [1]

您可能想回顧一下計算句子中單詞數量的方式。

你需要解決一些問題。

第一個你的word_length_list返回[1]而沒有別的。

將 function 更改為:

def word_length_list(sentence):
    return sentence.split()

接下來,我們需要更改collect_statistics中的一些內容以獲得您正在尋找的結果:

將 function 更改為:

def collect_statistics(sentence):
    word_lengths = word_length_list(sentence)
    words_in_sentence = len(word_lengths)
    sum_of_word_lengths = 0
    for word in word_lengths:
        sum_of_word_lengths += len(word)
    average_word_length = sum_of_word_lengths/words_in_sentence;
    return [words_in_sentence, average_word_length]

也就是說,數學中有一些行為會導致一些長小數返回,因此您需要對此進行補償。 我認為我得到的數字稍微多一些,因為代碼仍在計算. Sr.部分和? 所以你預期的 2.4 實際上是 2.7。

更新:

因此,我編寫了自己的版本,該版本縮短了大約 20 行,以執行正確計數的任務。 IE 不計算終止符。

如果您有任何問題,請告訴我:

terminators = ['?', '!', '.']
abrevs = ['Mrs', 'Mr', 'Dr', 'Fr', 'Jr', 'Sr']
story_text = 'Mr. Blah has a lot of Sr? and Mrs. blah does not care. lol'
word_list = story_text.split()
list_of_sentences = []
list_word_len = []
temp_list = []
stats = []


def end_sentence(sub_word):
    global temp_list
    temp_list.append(sub_word)
    list_of_sentences.append(temp_list)
    temp_list = []


for ndex, word in enumerate(word_list):
    if word[-1:] in terminators:
        sub_word = word.replace(word[-1], " ").split()[-1]
        if word[-1:] == '.':
            if any(abr in word for abr in abrevs):
                temp_list.append(sub_word)
            else:
                end_sentence(sub_word)
        else:
            end_sentence(sub_word)
    else:
        if ndex == len(word_list) - 1:
            end_sentence(word)
        else:
            temp_list.append(word)


for sentence in list_of_sentences:
    for word in sentence:
        list_word_len.append(len(word))
    stats.append([len(sentence), round(sum(list_word_len[0:len(list_word_len)]) / len(sentence), 2)])
    list_word_len = []

print(stats)

結果:

[[7, 2.43], [6, 3.5], [1, 3.0]]

請注意,我的第一個結果是 2.43 而不是 2.42,這是因為我四舍五入到小數點后第二位。 你不必這樣做。 您可以只取小數點后浮點數的前 2 個值,但我認為舍入比不舍入更接近。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM