我的代碼缺少一些我試圖退出文件的行

Question

基本任務是編寫一個函數 get_words_from_file(filename)，它返回感興趣區域內的小寫單詞列表。 他們與您共享一個正則表達式：“[az]+[-'][az]+|[az]+[']?|[az]+”，它查找所有符合此定義的單詞。 我的代碼在某些測試中運行良好，但在重復指示感興趣區域的行時失敗。 這是我的代碼：

import re

def get_words_from_file(filename):
    """Returns a list of lower case words that are with the region of 
    interest, every word in the text file, but, not any of the punctuation."""
    with open(filename,'r', encoding='utf-8') as file:
        flag = False
        words = []
        count = 0
        for line in file:
            if line.startswith("*** START OF"):
                while count < 1:
                    flag=True
                    count += 1
            elif line.startswith("*** END"):
                flag=False
                break       
            elif(flag):
                new_line = line.lower()
                words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", 
                                           new_line)
                words.extend(words_on_line)
    
        return words

#test code:
filename = "bee.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
    print(word)

問題是字符串“*** START OF”重復並且在感興趣區域內時不包括在內。

The test code should result in:
bee.txt loaded ok.↩
16 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
start↩
of↩
synthetic↩
test↩
case↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too

但我得到：

bee.txt loaded ok.↩
11 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too

任何幫助都會很棒！ 附件是文件截圖

Answer 1

您的代碼的具體問題是if .. elif .. elif語句，您忽略了所有看起來像表示塊開始或結束的行的行，即使它在測試塊中也是如此。

您想要這樣的功能：

def get_words_from_file(filename):
    """Returns a list of lower case words that are with the region of
    interest, every word in the text file, but, not any of the punctuation."""
    with open(filename, 'r', encoding='utf-8') as file:
        in_block = False
        words = []
        for line in file:
            if not in_block and line == "*** START OF A SYNTHETIC TEST CASE ***\n":
                in_block = True
            elif in_block and line == "*** END TEST CASE ***\n":
                break
            elif in_block:
                words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line.lower())
                words.extend(words_on_line)

        return words

這是假設您實際上正在尋找整行作為標記，但是當然，如果您實際上接受它作為塊的開始或結束，您仍然可以使用.startswith() ，只要它足夠明確。

您使用標志的想法很好，盡管將標志命名為它所代表的任何東西總是一個好主意。

我的代碼缺少一些我試圖退出文件的行

問題描述

1 個解決方案

解決方案1
0 2022-06-10 05:55:08

我的代碼缺少一些我試圖退出文件的行

問題描述

1 個解決方案

解決方案1 0 2022-06-10 05:55:08

解決方案1
0 2022-06-10 05:55:08