I have text that looks like this:
bla bla bla
bla some on wanted text....
****************************************************************************
List of 12 base pairs
nt1 nt2 bp name Saenger LW DSSR
1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W
2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W
3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W
****************************************************************************
another unwanted text ...
another unwanted text
Want I want to do is to extract the section that starts with List of xxx base pairs
and end with first *****
that it encounters.
There are cases where this section does not appear at all. If that happen it should output just "NONE"
.
How can I do that with Python?
I tried this but failed. That it prints no output at all.
import sys
import re
def main():
"""docstring for main"""
infile = "myfile.txt"
if len(sys.argv) > 1:
infile = sys.argv[1]
regex = re.compile(r"""List of (\d+) base pairs$""",re.VERBOSE)
with open(infile, 'r') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter='\t')
for row in tabreader:
if row:
line = row[0]
match = regex.match(line)
if match:
print line
if __name__ == '__main__':
main()
At the end of the code I was hoping it would just print this:
nt1 nt2 bp name Saenger LW DSSR
1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W
2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W
3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W
Or simply
NONE
At the end of the code I was hoping it would just print this:
There are couple of problems. The regex is a little too restrictive. The loop doesn't recognize the regex match as the starting point. And there isn't an early exit for the *******
endpoint.
Here's some working code to get you started:
import re
text = '''
bla bla bla
bla some on wanted text....
****************************************************************************
List of 12 base pairs
nt1 nt2 bp name Saenger LW DSSR
1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W
2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W
3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W
****************************************************************************
another unwanted text ...
another unwanted text
'''
regex = re.compile(r"List of (\d+) base pairs")
started = False
for line in text.splitlines():
if started:
if line.startswith('*******'):
break
print line
elif regex.search(line):
started = True
[ ]*List of \d+ base pairs\n*([\s\S]*?)(?=\n*\*{5,})
Try this regex with re.findall
.See demo.
https://regex101.com/r/eZ0yP4/20
import re
p = re.compile(r'[ ]*List of \d+ base pairs\n*([\s\S]*?)(?=\n*\*{5,})')
test_str = " bla bla bla \n bla some on wanted text....\n\n****************************************************************************\nList of 12 base pairs\n nt1 nt2 bp name Saenger LW DSSR\n 1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W\n 2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W\n 3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W\n\n****************************************************************************\nanother unwanted text ...\nanother unwanted text "
re.findall(p, test_str)
You could use the MULTILINE
and DOTALL
flags of the re module.
#!/usr/bin/python
import re
f = open('myfile.txt','r').read()
pat = re.compile("""
List\ of\ \d+\ base\ pairs$ # The start of the match
(.*?) # Note ? to make it nongreedy
^[*]+$ # The ending line
""", re.MULTILINE+re.DOTALL+re.VERBOSE)
mat = pat.search(f)
if mat:
print mat.group(1).strip()
else:
print 'NONE'
Notes:
?
after .*
to make it nongreedy if there is multiple lines of stars in the file. Lists\\ of\\ ...
) since re.VERBOSE
is used. Otherwise that whitespace would be ignored and no match would be found! Another regexp that could be tried:
f=open(my_file).read()
print ''.join(re.findall('\s+nt1[^\n]+\n|\s+\d+\sQ\.[^\n]+\n',f,re.M))
It accepts either stuff starting with nt1 or number + Q., as in the first string passed to re.findall
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.