[英]Regex to remove all punctuation and anything enclosed by brackets
I'm trying to remove all punctuation and anything inside brackets or parentheses from a string in python.我正在尝试从 python 中的字符串中删除所有标点符号和括号或括号内的任何内容。 The idea is to somewhat normalize song names to get better results when I query the MusicBrainz WebService.
这个想法是在我查询 MusicBrainz WebService 时对歌曲名称进行某种规范化以获得更好的结果。
Sample input: TNT (live) [nyc]
样本输入:
TNT (live) [nyc]
Expected output: TNT
预期 output:
TNT
I can do it in two regexes, but I would like to see if it can be done in just one.我可以用两个正则表达式来完成,但我想看看它是否可以只用一个来完成。 I tried the following, which didn't work...
我尝试了以下方法,但没有奏效......
>>> re.sub(r'\[.*?\]|\(.*?\)|\W+', ' ', 'T.N.T. (live) [nyc]')
'T N T live nyc '
If I split the \W+
into it's own regex and run it second, I get the expected result, so it seems that \W+
is eating the braces and parens before the first two options can deal with them.如果我将
\W+
拆分为它自己的正则表达式并第二次运行它,我会得到预期的结果,所以似乎\W+
在前两个选项可以处理它们之前正在吃大括号和括号。
You are correct that the \W+
is eating the braces, remove the +
and you should be set:你是正确的
\W+
正在吃大括号,删除+
并且你应该设置:
>>> re.sub(r'\[.*?\]|\(.*?\)|\W', ' ', 'T.N.T. (live) [nyc]')
'T N T '
Here's a mini-parser that does the same thing I wrote as an exercise.这是一个迷你解析器,它与我在练习中写的一样。 If your effort to normalize gets much more complex, you may start to look at parser-based solutions.
如果您的标准化工作变得更加复杂,您可能会开始考虑基于解析器的解决方案。 This works like a tiny parser.
这就像一个小型解析器。
# Remove all non-word chars and anything between parens or brackets
def consume(I):
I = iter(I)
lookbehind = None
def killuntil(returnchar):
while True:
ch = I.next()
if ch == returnchar:
return
for i in I:
if i in 'abcdefghijklmnopqrstuvwyzABCDEFGHIJKLMNOPQRSTUVWXYZ':
yield i
lookbehind = i
elif not i.strip() and lookbehind != ' ':
yield ' '
lookbehind = ' '
elif i == '(':
killuntil(')')
elif i == '[':
killuntil(']')
elif lookbehind != ' ':
lookbehind = ' '
yield ' '
s = "T.N.T. (live) [nyc]"
c = consume(s)
\W
\W
When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character;
未指定 LOCALE 和 UNICODE 标志时,匹配任何非字母数字字符; this is equivalent to the set [^a-zA-Z0-9_].
这等价于集合 [^a-zA-Z0-9_]。
So try r'\[.*?\]|\(.*?\)|{.*?}|[^a-zA-Z0-9_()[\]{}]+'
.所以试试
r'\[.*?\]|\(.*?\)|{.*?}|[^a-zA-Z0-9_()[\]{}]+'
。
Andrew's solution is probably better, though.不过,安德鲁的解决方案可能更好。
The \W+
eats the brackets, because it "has a run": It starts matching at the dot after the second T, and matches on until and including the first parenthesis: . (
\W+
吃掉括号,因为它“有一个运行”:它从第二个 T 之后的点开始匹配,并匹配直到并包括第一个括号: . (
. (
. After that, it starts matching again from bracket to bracket: ) [
. . (
. 之后,它再次开始从括号到括号:) ) [
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.