我的代码缺少一些我试图退出文件的行

Question

The basic task is to write a function, get_words_from_file(filename), that returns a list of lower case words that are within the region of interest.基本任务是编写一个函数 get_words_from_file(filename)，它返回感兴趣区域内的小写单词列表。 They share with you a regular expression: "[az]+[-'][az]+|[az]+[']?|[az]+", that finds all words that meet this definition.他们与您共享一个正则表达式：“[az]+[-'][az]+|[az]+[']?|[az]+”，它查找所有符合此定义的单词。 My code works well on some of the tests but fails when the line that indicates the region of interest is repeated.我的代码在某些测试中运行良好，但在重复指示感兴趣区域的行时失败。 Here's is my code:这是我的代码：

import re

def get_words_from_file(filename):
    """Returns a list of lower case words that are with the region of 
    interest, every word in the text file, but, not any of the punctuation."""
    with open(filename,'r', encoding='utf-8') as file:
        flag = False
        words = []
        count = 0
        for line in file:
            if line.startswith("*** START OF"):
                while count < 1:
                    flag=True
                    count += 1
            elif line.startswith("*** END"):
                flag=False
                break       
            elif(flag):
                new_line = line.lower()
                words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", 
                                           new_line)
                words.extend(words_on_line)
    
        return words

#test code:
filename = "bee.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
    print(word)

The issue is the string "*** START OF" is repeated and isn't included when it is inside the region of interest.问题是字符串“*** START OF”重复并且在感兴趣区域内时不包括在内。

The test code should result in:
bee.txt loaded ok.↩
16 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
start↩
of↩
synthetic↩
test↩
case↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too

But I'm getting:但我得到：

bee.txt loaded ok.↩
11 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too

Any help would be great!任何帮助都会很棒！ Attached is a screenshot of the file附件是文件截图

Answer 1

The specific problem of your code is the if .. elif .. elif statement, you're ignoring all lines that look like the line that signals the start or end of a block, even if it's in the test block.您的代码的具体问题是if .. elif .. elif语句，您忽略了所有看起来像表示块开始或结束的行的行，即使它在测试块中也是如此。

You wanted something like this for your function:您想要这样的功能：

def get_words_from_file(filename):
    """Returns a list of lower case words that are with the region of
    interest, every word in the text file, but, not any of the punctuation."""
    with open(filename, 'r', encoding='utf-8') as file:
        in_block = False
        words = []
        for line in file:
            if not in_block and line == "*** START OF A SYNTHETIC TEST CASE ***\n":
                in_block = True
            elif in_block and line == "*** END TEST CASE ***\n":
                break
            elif in_block:
                words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line.lower())
                words.extend(words_on_line)

        return words

This is assuming you are actually looking for the whole line as a marker, but of course you can still use .startswith() if you actually accept that as the start or end of the block, as long as it's sufficiently unambiguous.这是假设您实际上正在寻找整行作为标记，但是当然，如果您实际上接受它作为块的开始或结束，您仍然可以使用.startswith() ，只要它足够明确。

Your idea of using a flag is fine, although naming a flag to whatever it represents is always a good idea.您使用标志的想法很好，尽管将标志命名为它所代表的任何东西总是一个好主意。

我的代码缺少一些我试图退出文件的行

问题描述

1 个解决方案

解决方案1
0 2022-06-10 05:55:08

我的代码缺少一些我试图退出文件的行

问题描述

1 个解决方案

解决方案1 0 2022-06-10 05:55:08

解决方案1
0 2022-06-10 05:55:08