选择文本文件中模式之间的所有行/字符串

Question

Given a text file that looks like this when loaded:给定一个加载时看起来像这样的文本文件：

>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp

How can I extract all lines that fall between lines that contain '>' and the last lines where there is no ending '>'?如何提取包含“>”的行和没有结尾“>”的最后一行之间的所有行？

For example, the result should look like this例如，结果应如下所示

result = ['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN','SSSSSSSSSSS','pppppppppppppppppppppppppppppppppppppppppp']

I'm realizing what I did won't work because its looking for text between each new line and '>'.我意识到我所做的不会起作用，因为它在每个新行和“>”之间寻找文本。 Running this just gives me empty strings.运行它只会给我空字符串。

def findtext(inputtextfile, start, end):
    try:
       pattern=rf'{start}(.*?){end}'
       return re.findall(pattern, inputtextfile)
    except ValueError:
       return -1
result = findtext(inputtextfile,"\n", ">")

Answer 1

Maybe try splitting on rows that start with > , that way you get back a list of the data between and can join those after replacing the \n也许尝试拆分以>开头的行，这样您就可以返回之间的数据列表，并可以在替换\n后加入这些数据

s = """>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp"""

def findtext(inputtextfile, start, end):
    import re
    try:
        return [''.join(x.replace('\n','')) for x in list(filter(None,re.split(f'{start}.*{end}',s)))]
    except ValueError:
        return -1

Trying with your provided case尝试使用您提供的案例

findtext(s, '>','\n')

Output Output

['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
 'SSSSSSSSSSS',
 'pppppppppppppppppppppppppppppppppppppppppp']

Answer 2

One option could be using re.split on the line that starts with > and then remove all the whitespace chars from the parts.一种选择是在以>开头的行上使用 re.split，然后从部件中删除所有空白字符。

import re

pattern = r"^>.*"

s = (">rice1 1ALBRGHAER\n"
            "NNNNNNNNNNNNNNNNNNNNN\n"
            "NNNNNNNNNNNNNNNNNNNNN\n"
            ">peanuts2 2LAEKaq\n"
            "SSSSSSSSSSS\n"
            ">OIL3 3hkasUGSV\n"
            "ppppppppppppppppppppp\n"
            "ppppppppppppppppppppp")

res = [re.sub(r"\s+", "", s) for s in re.split(pattern, s, 0, re.M) if s]
print(res)

Output (formatted a bit) Output（格式化了一下）

[
  'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
  'SSSSSSSSSSS',
  'pppppppppppppppppppppppppppppppppppppppppp'
]

See a Python demo .请参阅Python 演示。

选择文本文件中模式之间的所有行/字符串

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-09-21 16:04:49

解决方案2
0 2022-09-21 16:15:23

选择文本文件中模式之间的所有行/字符串

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-09-21 16:04:49

解决方案2 0 2022-09-21 16:15:23

解决方案1
1 已采纳 2022-09-21 16:04:49

解决方案2
0 2022-09-21 16:15:23