正则表达式删除所有标点符号和括号括起来的任何内容

Question

I'm trying to remove all punctuation and anything inside brackets or parentheses from a string in python.我正在尝试从 python 中的字符串中删除所有标点符号和括号或括号内的任何内容。 The idea is to somewhat normalize song names to get better results when I query the MusicBrainz WebService.这个想法是在我查询 MusicBrainz WebService 时对歌曲名称进行某种规范化以获得更好的结果。

Sample input: TNT (live) [nyc]样本输入： TNT (live) [nyc]

Expected output: TNT预期 output： TNT

I can do it in two regexes, but I would like to see if it can be done in just one.我可以用两个正则表达式来完成，但我想看看它是否可以只用一个来完成。 I tried the following, which didn't work...我尝试了以下方法，但没有奏效......

>>> re.sub(r'\[.*?\]|\(.*?\)|\W+', ' ', 'T.N.T. (live) [nyc]')
'T N T live nyc '

If I split the \W+ into it's own regex and run it second, I get the expected result, so it seems that \W+ is eating the braces and parens before the first two options can deal with them.如果我将\W+拆分为它自己的正则表达式并第二次运行它，我会得到预期的结果，所以似乎\W+在前两个选项可以处理它们之前正在吃大括号和括号。

Answer 1

You are correct that the \W+ is eating the braces, remove the + and you should be set:你是正确的\W+正在吃大括号，删除+并且你应该设置：

>>> re.sub(r'\[.*?\]|\(.*?\)|\W', ' ', 'T.N.T. (live) [nyc]')
'T N T     '

Answer 2

Here's a mini-parser that does the same thing I wrote as an exercise.这是一个迷你解析器，它与我在练习中写的一样。 If your effort to normalize gets much more complex, you may start to look at parser-based solutions.如果您的标准化工作变得更加复杂，您可能会开始考虑基于解析器的解决方案。 This works like a tiny parser.这就像一个小型解析器。

# Remove all non-word chars and anything between parens or brackets

def consume(I):

   I = iter(I)
   lookbehind = None

   def killuntil(returnchar):
      while True:
         ch = I.next()
         if ch == returnchar:
            return

   for i in I:
      if i in 'abcdefghijklmnopqrstuvwyzABCDEFGHIJKLMNOPQRSTUVWXYZ':
         yield i
         lookbehind = i
      elif not i.strip() and lookbehind != ' ':
         yield ' '
         lookbehind = ' '
      elif i == '(': 
         killuntil(')')
      elif i == '[': 
         killuntil(']')
      elif lookbehind != ' ':
         lookbehind = ' '
         yield ' '

s = "T.N.T. (live) [nyc]"
c = consume(s)

Answer 3

\W \W

When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character;未指定 LOCALE 和 UNICODE 标志时，匹配任何非字母数字字符； this is equivalent to the set [^a-zA-Z0-9_].这等价于集合 [^a-zA-Z0-9_]。

So try r'\[.*?\]|\(.*?\)|{.*?}|[^a-zA-Z0-9_()[\]{}]+' .所以试试r'\[.*?\]|\(.*?\)|{.*?}|[^a-zA-Z0-9_()[\]{}]+' 。

Andrew's solution is probably better, though.不过，安德鲁的解决方案可能更好。

Answer 4

The \W+ eats the brackets, because it "has a run": It starts matching at the dot after the second T, and matches on until and including the first parenthesis: . ( \W+吃掉括号，因为它“有一个运行”：它从第二个 T 之后的点开始匹配，并匹配直到并包括第一个括号： . ( . ( . After that, it starts matching again from bracket to bracket: ) [ . . ( . 之后，它再次开始从括号到括号：) ) [ 。

正则表达式删除所有标点符号和括号括起来的任何内容

问题描述

4 个解决方案

解决方案1
3 已采纳 2011-05-26 20:10:26

解决方案2
1 2011-05-26 21:00:13

解决方案3
0 2011-05-26 20:07:29

解决方案4
0 2011-05-26 20:08:23

正则表达式删除所有标点符号和括号括起来的任何内容

问题描述

4 个解决方案

解决方案1 3 已采纳 2011-05-26 20:10:26

解决方案2 1 2011-05-26 21:00:13

解决方案3 0 2011-05-26 20:07:29

解决方案4 0 2011-05-26 20:08:23

解决方案1
3 已采纳 2011-05-26 20:10:26

解决方案2
1 2011-05-26 21:00:13

解决方案3
0 2011-05-26 20:07:29

解决方案4
0 2011-05-26 20:08:23