[英]Repeatedly extract text between two delimiters with a variable in a text file in Python
[英]Repeatedly extract a line between two delimiters in a text file, Python
我有以下格式的文本文件:
DELIMITER1
extract me
extract me
extract me
DELIMITER2
我想提取 .txt 文件中 DELIMITER1 和 DELIMITER2 之間的每個extract me
的塊
這是我當前的不良代碼:
import re
def GetTheSentences(file):
fileContents = open(file)
start_rx = re.compile('DELIMITER')
end_rx = re.compile('DELIMITER2')
line_iterator = iter(fileContents)
start = False
for line in line_iterator:
if re.findall(start_rx, line):
start = True
break
while start:
next_line = next(line_iterator)
if re.findall(end_rx, next_line):
break
print next_line
continue
line_iterator.next()
有任何想法嗎?
您可以使用re.S
( DOTALL 標志)將其簡化為一個正則表達式。
import re
def GetTheSentences(infile):
with open(infile) as fp:
for result in re.findall('DELIMITER1(.*?)DELIMITER2', fp.read(), re.S):
print result
# extract me
# extract me
# extract me
這也利用了非貪婪運算符.*?
,因此將找到多個 DELIMITER1-DELIMITER2 對的非重疊塊。
這應該做你想要的:
import re
def GetTheSentences(file):
start_rx = re.compile('DELIMITER')
end_rx = re.compile('DELIMITER2')
start = False
output = []
with open(file, 'rb') as datafile:
for line in datafile.readlines():
if re.match(start_rx, line):
start = True
elif re.match(end_rx, line):
start = False
if start:
output.append(line)
return output
您以前的版本看起來應該是迭代器 function。 您希望您的 output 一次退回一件商品嗎? 這有點不同。
如果分隔符在一行內:
def get_sentences(filename):
with open(filename) as file_contents:
d1, d2 = '.', ',' # just example delimiters
for line in file_contents:
i1, i2 = line.find(d1), line.find(d2)
if -1 < i1 < i2:
yield line[i1+1:i2]
sentences = list(get_sentences('path/to/my/file'))
如果他們在自己的線上:
def get_sentences(filename):
with open(filename) as file_contents:
d1, d2 = '.', ',' # just example delimiters
results = []
for line in file_contents:
if d1 in line:
results = []
elif d2 in line:
yield results
else:
results.append(line)
sentences = list(get_sentences('path/to/my/file'))
這是列表推導的好工作,不需要正則表達式。 第一個列表 comp 清除打開 txt 文件時找到的文本行列表中的典型\n
。 第二個列表 comp 僅使用in
運算符來識別要過濾的序列模式。
def extract_lines(file):
scrubbed = [x.strip('\n') for x in open(file, 'r')]
return [x for x in scrubbed if x not in ('DELIMITER1','DELIMITER2')]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.