[英]Selecting all lines/strings that fall between pattern in text file
Given a text file that looks like this when loaded:给定一个加载时看起来像这样的文本文件:
>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp
How can I extract all lines that fall between lines that contain '>' and the last lines where there is no ending '>'?如何提取包含“>”的行和没有结尾“>”的最后一行之间的所有行?
For example, the result should look like this例如,结果应如下所示
result = ['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN','SSSSSSSSSSS','pppppppppppppppppppppppppppppppppppppppppp']
I'm realizing what I did won't work because its looking for text between each new line and '>'.我意识到我所做的不会起作用,因为它在每个新行和“>”之间寻找文本。 Running this just gives me empty strings.运行它只会给我空字符串。
def findtext(inputtextfile, start, end):
try:
pattern=rf'{start}(.*?){end}'
return re.findall(pattern, inputtextfile)
except ValueError:
return -1
result = findtext(inputtextfile,"\n", ">")
Maybe try splitting on rows that start with >
, that way you get back a list of the data between and can join those after replacing the \n
也许尝试拆分以>
开头的行,这样您就可以返回之间的数据列表,并可以在替换\n
后加入这些数据
s = """>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp"""
def findtext(inputtextfile, start, end):
import re
try:
return [''.join(x.replace('\n','')) for x in list(filter(None,re.split(f'{start}.*{end}',s)))]
except ValueError:
return -1
Trying with your provided case尝试使用您提供的案例
findtext(s, '>','\n')
Output Output
['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
'SSSSSSSSSSS',
'pppppppppppppppppppppppppppppppppppppppppp']
One option could be using re.split on the line that starts with >
and then remove all the whitespace chars from the parts.一种选择是在以>
开头的行上使用 re.split,然后从部件中删除所有空白字符。
import re
pattern = r"^>.*"
s = (">rice1 1ALBRGHAER\n"
"NNNNNNNNNNNNNNNNNNNNN\n"
"NNNNNNNNNNNNNNNNNNNNN\n"
">peanuts2 2LAEKaq\n"
"SSSSSSSSSSS\n"
">OIL3 3hkasUGSV\n"
"ppppppppppppppppppppp\n"
"ppppppppppppppppppppp")
res = [re.sub(r"\s+", "", s) for s in re.split(pattern, s, 0, re.M) if s]
print(res)
Output (formatted a bit) Output(格式化了一下)
[
'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
'SSSSSSSSSSS',
'pppppppppppppppppppppppppppppppppppppppppp'
]
See a Python demo .请参阅Python 演示。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.