[英]My code is missing some of the lines im trying to get out of a file
The basic task is to write a function, get_words_from_file(filename), that returns a list of lower case words that are within the region of interest.
基本任务是编写一个函数 get_words_from_file(filename),它返回感兴趣区域内的小写单词列表。 They share with you a regular expression: "[az]+[-'][az]+|[az]+[']?|[az]+", that finds all words that meet this definition.
他们与您共享一个正则表达式:“[az]+[-'][az]+|[az]+[']?|[az]+”,它查找所有符合此定义的单词。 My code works well on some of the tests but fails when the line that indicates the region of interest is repeated.
我的代码在某些测试中运行良好,但在重复指示感兴趣区域的行时失败。 Here's is my code:
这是我的代码:
import re
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename,'r', encoding='utf-8') as file:
flag = False
words = []
count = 0
for line in file:
if line.startswith("*** START OF"):
while count < 1:
flag=True
count += 1
elif line.startswith("*** END"):
flag=False
break
elif(flag):
new_line = line.lower()
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+",
new_line)
words.extend(words_on_line)
return words
#test code:
filename = "bee.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
print(word)
The issue is the string "*** START OF" is repeated and isn't included when it is inside the region of interest.问题是字符串“*** START OF”重复并且在感兴趣区域内时不包括在内。
The test code should result in:
bee.txt loaded ok.↩
16 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
start↩
of↩
synthetic↩
test↩
case↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
But I'm getting:但我得到:
bee.txt loaded ok.↩
11 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
Any help would be great!任何帮助都会很棒! Attached is a screenshot of the file
附件是文件截图
The specific problem of your code is the if .. elif .. elif
statement, you're ignoring all lines that look like the line that signals the start or end of a block, even if it's in the test block.您的代码的具体问题是
if .. elif .. elif
语句,您忽略了所有看起来像表示块开始或结束的行的行,即使它在测试块中也是如此。
You wanted something like this for your function:您想要这样的功能:
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename, 'r', encoding='utf-8') as file:
in_block = False
words = []
for line in file:
if not in_block and line == "*** START OF A SYNTHETIC TEST CASE ***\n":
in_block = True
elif in_block and line == "*** END TEST CASE ***\n":
break
elif in_block:
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line.lower())
words.extend(words_on_line)
return words
This is assuming you are actually looking for the whole line as a marker, but of course you can still use .startswith()
if you actually accept that as the start or end of the block, as long as it's sufficiently unambiguous.这是假设您实际上正在寻找整行作为标记,但是当然,如果您实际上接受它作为块的开始或结束,您仍然可以使用
.startswith()
,只要它足够明确。
Your idea of using a flag is fine, although naming a flag to whatever it represents is always a good idea.您使用标志的想法很好,尽管将标志命名为它所代表的任何东西总是一个好主意。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.