简体   繁体   中英

Extracting contents of a string within parentheses

I have the following string:

string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)"

I would like to create a list of tuples in the form of [(actor_name, character_name),...] like so:

[(Will Ferrell, Nick Halsey), (Rebecca Hall, Samantha), (Michael Pena, Frank Garcia)]

I am currently using a hack-ish way to do this, by splitting by the ( mark and then using.rstrip('('), like so:

for item in string.split(','):
    item.rstrip(')').split('(')

Is there a better, more robust way to do this? Thank you.

string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)"

import re
pat = re.compile(r'([^(]+)\s*\(([^)]+)\)\s*(?:,\s*|$)')

lst = [(t[0].strip(), t[1].strip()) for t in pat.findall(string)]

The compiled pattern is a bit tricky. It's a raw string, to make the backslashes less insane. What it means is: start a match group; match anything that isn't a '(' character, any number of times as long as it is at least once; close the match group; match a literal '(' character; start another match group; match anything that isn't a ')' character, any number of times as long as it is at least once; close the match group; match a literal ')' character; then match any white space (including none); then something really tricky. The really tricky part is a grouping that doesn't form a match group. Instead of starting with '(' and ending with ')' it starts with "(?:" and then again ends with ')'. I used this grouping so I could put a vertical bar in to allow two alternate patterns: either a comma matches followed by any amount of white space, or else the end of the line was reached (the '$' character).

Then I used pat.findall() to find all the places within string that the pattern matches; it automatically returns tuples. I put that in a list comprehension and called .strip() on each item to clean off white space.

Of course, we can just make the regular expression even more complicated and have it return names that already have white space stripped off. The regular expression gets really hairy, though, so we will use one of the coolest features in Python regular expressions: "verbose" mode, where you can sprawl a pattern across many lines and put comments as you like. We are using a raw triple-quote string so the backslashes are convenient and the multiple lines are convenient. Here you go:

import re
s_pat = r'''
\s*  # any amount of white space
([^( \t]  # start match group; match one char that is not a '(' or space or tab
[^(]*  # match any number of non '(' characters
[^( \t])  # match one char that is not a '(' or space or tab; close match group
\s*  # any amount of white space
\(  # match an actual required '(' char (not in any match group)
\s*  # any amount of white space
([^) \t]  # start match group; match one char that is not a ')' or space or tab
[^)]*  # match any number of non ')' characters
[^) \t])  # match one char that is not a ')' or space or tab; close match group
\s*  # any amount of white space
\) # match an actual required ')' char (not in any match group)
\s*  # any amount of white space
(?:,|$)  # non-match group: either a comma or the end of a line
'''
pat = re.compile(s_pat, re.VERBOSE)

lst = pat.findall(string)

Man, that really wasn't worth the effort.

Also, the above preserves the white space inside the names. You could easily normalize the white space, to make sure it is 100% consistent, by splitting on white space and rejoining with spaces.

string = '  Will   Ferrell  ( Nick\tHalsey ) , Rebecca Hall (Samantha), Michael\fPena (Frank Garcia)'

import re
pat = re.compile(r'([^(]+)\s*\(([^)]+)\)\s*(?:,\s*|$)')

def nws(s):
    """normalize white space.  Replaces all runs of white space by a single space."""
    return " ".join(w for w in s.split())

lst = [tuple(nws(item) for item in t) for t in pat.findall(string)]

print lst # prints: [('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), ('Michael Pena', 'Frank Garcia')]

Now the string has silly white space: multiple spaces, a tab, and even a form feed ("\f") in it. The above cleans it up so that names are separated by a single space.

A good place for regular expressions:

>>> import re
>>> pat = "([^,\(]*)\((.*?)\)"
>>> re.findall(pat, "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Michael Pena (Frank Garcia)")
[('Will Ferrell ', 'Nick Halsey'), (' Rebecca Hall ', 'Samantha'), (' Michael Pena ', 'Frank Garcia')]

A slightly more explicit answer than others, I think it meets your needs:

import re
regex = re.compile(r'([a-zA-Z]+ [a-zA-Z]+) \(([a-zA-Z]+ [a-zA-Z]+)\)')
actor_character = regex.findall(string)

I'll admit it's a little ugly, but like I said more explicit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM