簡體   English   中英

我的代碼缺少一些我試圖退出文件的行

[英]My code is missing some of the lines im trying to get out of a file

在此處輸入圖像描述 基本任務是編寫一個函數 get_words_from_file(filename),它返回感興趣區域內的小寫單詞列表。 他們與您共享一個正則表達式:“[az]+[-'][az]+|[az]+[']?|[az]+”,它查找所有符合此定義的單詞。 我的代碼在某些測試中運行良好,但在重復指示感興趣區域的行時失敗。 這是我的代碼:

import re

def get_words_from_file(filename):
    """Returns a list of lower case words that are with the region of 
    interest, every word in the text file, but, not any of the punctuation."""
    with open(filename,'r', encoding='utf-8') as file:
        flag = False
        words = []
        count = 0
        for line in file:
            if line.startswith("*** START OF"):
                while count < 1:
                    flag=True
                    count += 1
            elif line.startswith("*** END"):
                flag=False
                break       
            elif(flag):
                new_line = line.lower()
                words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", 
                                           new_line)
                words.extend(words_on_line)
    
        return words

#test code:
filename = "bee.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
    print(word)

問題是字符串“*** START OF”重復並且在感興趣區域內時不包括在內。

The test code should result in:
bee.txt loaded ok.↩
16 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
start↩
of↩
synthetic↩
test↩
case↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too

但我得到:

bee.txt loaded ok.↩
11 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too

任何幫助都會很棒! 附件是文件截圖

您的代碼的具體問題是if .. elif .. elif語句,您忽略了所有看起來像表示塊開始或結束的行的行,即使它在測試塊中也是如此。

您想要這樣的功能:

def get_words_from_file(filename):
    """Returns a list of lower case words that are with the region of
    interest, every word in the text file, but, not any of the punctuation."""
    with open(filename, 'r', encoding='utf-8') as file:
        in_block = False
        words = []
        for line in file:
            if not in_block and line == "*** START OF A SYNTHETIC TEST CASE ***\n":
                in_block = True
            elif in_block and line == "*** END TEST CASE ***\n":
                break
            elif in_block:
                words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line.lower())
                words.extend(words_on_line)

        return words

這是假設您實際上正在尋找整行作為標記,但是當然,如​​果您實際上接受它作為塊的開始或結束,您仍然可以使用.startswith() ,只要它足夠明確。

您使用標志的想法很好,盡管將標志命名為它所代表的任何東西總是一個好主意。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM