简体   繁体   中英

Parsing plain text with section in Python

I have text that looks like this:

    bla bla bla 
    bla some on wanted text....

****************************************************************************
List of 12 base pairs
      nt1              nt2             bp  name         Saenger     LW  DSSR
   1 Q.C0             Q.G22            C-G WC           19-XIX     cWW  cW-W
   2 Q.C1             Q.G21            C-G WC           19-XIX     cWW  cW-W
   3 Q.U2             Q.A20            U-A WC           20-XX      cWW  cW-W

****************************************************************************
another unwanted text ...
another unwanted text 

Want I want to do is to extract the section that starts with List of xxx base pairs and end with first ***** that it encounters.

There are cases where this section does not appear at all. If that happen it should output just "NONE" .

How can I do that with Python?

I tried this but failed. That it prints no output at all.

import sys
import re

def main():
    """docstring for main"""
    infile = "myfile.txt"
    if len(sys.argv) > 1:
        infile = sys.argv[1]

    regex = re.compile(r"""List of (\d+) base pairs$""",re.VERBOSE)

    with open(infile, 'r') as tsvfile:
        tabreader = csv.reader(tsvfile, delimiter='\t')

        for row in tabreader:
            if row:
                line = row[0]
                match = regex.match(line)
                if match:
                    print line



if __name__ == '__main__':
    main()

At the end of the code I was hoping it would just print this:

      nt1              nt2             bp  name         Saenger     LW  DSSR
   1 Q.C0             Q.G22            C-G WC           19-XIX     cWW  cW-W
   2 Q.C1             Q.G21            C-G WC           19-XIX     cWW  cW-W
   3 Q.U2             Q.A20            U-A WC           20-XX      cWW  cW-W

Or simply

NONE

At the end of the code I was hoping it would just print this:

There are couple of problems. The regex is a little too restrictive. The loop doesn't recognize the regex match as the starting point. And there isn't an early exit for the ******* endpoint.

Here's some working code to get you started:

import re

text = '''
    bla bla bla 
    bla some on wanted text....

****************************************************************************
List of 12 base pairs
      nt1              nt2             bp  name         Saenger     LW  DSSR
   1 Q.C0             Q.G22            C-G WC           19-XIX     cWW  cW-W
   2 Q.C1             Q.G21            C-G WC           19-XIX     cWW  cW-W
   3 Q.U2             Q.A20            U-A WC           20-XX      cWW  cW-W

****************************************************************************
another unwanted text ...
another unwanted text
'''

regex = re.compile(r"List of (\d+) base pairs")

started = False
for line in text.splitlines():
    if started:
        if line.startswith('*******'):
            break
        print line
    elif regex.search(line):
        started = True
[ ]*List of \d+ base pairs\n*([\s\S]*?)(?=\n*\*{5,})

Try this regex with re.findall .See demo.

https://regex101.com/r/eZ0yP4/20

import re
p = re.compile(r'[ ]*List of \d+ base pairs\n*([\s\S]*?)(?=\n*\*{5,})')
test_str = " bla bla bla \n bla some on wanted text....\n\n****************************************************************************\nList of 12 base pairs\n nt1 nt2 bp name Saenger LW DSSR\n 1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W\n 2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W\n 3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W\n\n****************************************************************************\nanother unwanted text ...\nanother unwanted text "

re.findall(p, test_str)

You could use the MULTILINE and DOTALL flags of the re module.

#!/usr/bin/python

import re

f = open('myfile.txt','r').read()

pat = re.compile("""
    List\ of\ \d+\ base\ pairs$  # The start of the match
    (.*?)                        # Note ? to make it nongreedy
    ^[*]+$                       # The ending line
    """, re.MULTILINE+re.DOTALL+re.VERBOSE)

mat = pat.search(f)

if mat:
    print mat.group(1).strip()
else:
    print 'NONE'

Notes:

  • You need ? after .* to make it nongreedy if there is multiple lines of stars in the file.
  • The whitespace in the initial string needs to be escaped ( Lists\\ of\\ ... ) since re.VERBOSE is used. Otherwise that whitespace would be ignored and no match would be found!

Another regexp that could be tried:

f=open(my_file).read()
print ''.join(re.findall('\s+nt1[^\n]+\n|\s+\d+\sQ\.[^\n]+\n',f,re.M))

It accepts either stuff starting with nt1 or number + Q., as in the first string passed to re.findall .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM