简体   繁体   中英

Regex to remove all punctuation and anything enclosed by brackets

I'm trying to remove all punctuation and anything inside brackets or parentheses from a string in python. The idea is to somewhat normalize song names to get better results when I query the MusicBrainz WebService.

Sample input: TNT (live) [nyc]

Expected output: TNT

I can do it in two regexes, but I would like to see if it can be done in just one. I tried the following, which didn't work...

>>> re.sub(r'\[.*?\]|\(.*?\)|\W+', ' ', 'T.N.T. (live) [nyc]')
'T N T live nyc '

If I split the \W+ into it's own regex and run it second, I get the expected result, so it seems that \W+ is eating the braces and parens before the first two options can deal with them.

You are correct that the \W+ is eating the braces, remove the + and you should be set:

>>> re.sub(r'\[.*?\]|\(.*?\)|\W', ' ', 'T.N.T. (live) [nyc]')
'T N T     '

Here's a mini-parser that does the same thing I wrote as an exercise. If your effort to normalize gets much more complex, you may start to look at parser-based solutions. This works like a tiny parser.

# Remove all non-word chars and anything between parens or brackets

def consume(I):

   I = iter(I)
   lookbehind = None

   def killuntil(returnchar):
      while True:
         ch = I.next()
         if ch == returnchar:
            return

   for i in I:
      if i in 'abcdefghijklmnopqrstuvwyzABCDEFGHIJKLMNOPQRSTUVWXYZ':
         yield i
         lookbehind = i
      elif not i.strip() and lookbehind != ' ':
         yield ' '
         lookbehind = ' '
      elif i == '(': 
         killuntil(')')
      elif i == '[': 
         killuntil(']')
      elif lookbehind != ' ':
         lookbehind = ' '
         yield ' '

s = "T.N.T. (live) [nyc]"
c = consume(s)

\W

When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_].

So try r'\[.*?\]|\(.*?\)|{.*?}|[^a-zA-Z0-9_()[\]{}]+' .

Andrew's solution is probably better, though.

The \W+ eats the brackets, because it "has a run": It starts matching at the dot after the second T, and matches on until and including the first parenthesis: . ( . ( . After that, it starts matching again from bracket to bracket: ) [ .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM