简体   繁体   中英

Converting Perl Regular Expressions to Python Regular Expressions

I'm having trouble converting a Perl regex to Python. The text I'm trying to match has the following pattern:

Author(s)    : Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname

In perl I was able to match this and extract the authors with

/Author\(s\)    :((.+\n)+?)/

When I try

re.compile(r'Author\(s\)    :((.+\n)+?)')

in Python, it matches the first author twice and ignores the rest.

Can anyone explain what I am doing wrong here?

You can do this:

# find lines with authors
import re

# multiline string to simulate possible input
text = '''
Stuff before
This won't be matched...
Author(s)    : Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname
Other(s)     : Something else we won't match
               More shenanigans....
Only the author names will be matched.
'''

# run the regex to pull author lines from the sample input
authors = re.search(r'Author\(s\)\s*:\s*(.*?)^[^\s]', text, re.DOTALL | re.MULTILINE).group(1)

The above regex matches the beginning text (Author(s), whitespace, colon, whitespace) and it gives you the results below by matching all lines afterward that begin with whitespace:

'''Firstname Lastname  
           Firstname Lastname  
           Firstname Lastname  
           Firstname Lastname
'''

You can then use the below regex to group all authors from those results

# grab authors from the lines
import re
authors = '''Firstname Lastname  
           Firstname Lastname  
           Firstname Lastname  
           Firstname Lastname
'''

# run the regex to pull a list of individual authors from the author lines
authors = re.findall(r'^\s*(.+?)\s*$', authors, re.MULTILINE)

Which gives you the list of authors:

['Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname']

Combined example code:

text = '''
Stuff before
This won't be matched...
Author(s)    : Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname  
               Firstname Lastname
Other(s)     : Something else we won't match
               More shenanigans....
Only the author names will be matched.
'''

import re
stage1 = re.compile(r'Author\(s\)\s*:\s*(.*?)^[^\s]', re.DOTALL | re.MULTILINE)
stage2 = re.compile('^\s*(.+?)\s*$', re.MULTILINE)

preliminary = stage1.search(text).group(1)
authors = stage2.findall(preliminary)

Which sets authors to:

['Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname', 'Firstname Lastname']

Success!

One group can only match a single time. So even if your matching group is repeated, you can only access the last actual match. You'll have to match all names at once and split them then (via newline or even new regexps).

Try

re.compile(r'Author\(s\)    :((.+\n)+)')

In your original expression, the +? indicated that you want the match non-greedy, ie minimal.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM