![](/img/trans.png)
[英]Im trying to do some code in python that reads a text file and picks out the 5 lines with the highest number and prints them
[英]My code is missing some of the lines im trying to get out of a file
基本任務是編寫一個函數 get_words_from_file(filename),它返回感興趣區域內的小寫單詞列表。 他們與您共享一個正則表達式:“[az]+[-'][az]+|[az]+[']?|[az]+”,它查找所有符合此定義的單詞。 我的代碼在某些測試中運行良好,但在重復指示感興趣區域的行時失敗。 這是我的代碼:
import re
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename,'r', encoding='utf-8') as file:
flag = False
words = []
count = 0
for line in file:
if line.startswith("*** START OF"):
while count < 1:
flag=True
count += 1
elif line.startswith("*** END"):
flag=False
break
elif(flag):
new_line = line.lower()
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+",
new_line)
words.extend(words_on_line)
return words
#test code:
filename = "bee.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
print(word)
問題是字符串“*** START OF”重復並且在感興趣區域內時不包括在內。
The test code should result in:
bee.txt loaded ok.↩
16 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
start↩
of↩
synthetic↩
test↩
case↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
但我得到:
bee.txt loaded ok.↩
11 valid words found.↩
Valid word list:↩
yes↩
really↩
this↩
time↩
end↩
synthetic↩
test↩
case↩
i'm↩
in↩
too
任何幫助都會很棒! 附件是文件截圖
您的代碼的具體問題是if .. elif .. elif
語句,您忽略了所有看起來像表示塊開始或結束的行的行,即使它在測試塊中也是如此。
您想要這樣的功能:
def get_words_from_file(filename):
"""Returns a list of lower case words that are with the region of
interest, every word in the text file, but, not any of the punctuation."""
with open(filename, 'r', encoding='utf-8') as file:
in_block = False
words = []
for line in file:
if not in_block and line == "*** START OF A SYNTHETIC TEST CASE ***\n":
in_block = True
elif in_block and line == "*** END TEST CASE ***\n":
break
elif in_block:
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line.lower())
words.extend(words_on_line)
return words
這是假設您實際上正在尋找整行作為標記,但是當然,如果您實際上接受它作為塊的開始或結束,您仍然可以使用.startswith()
,只要它足夠明確。
您使用標志的想法很好,盡管將標志命名為它所代表的任何東西總是一個好主意。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.