[英]How to remove everything from parens except if it contains given keywords
So I have this piece of code to filter out words from an incoming string: 所以我有这段代码来过滤输入字符串中的单词:
RemoveWords = "\\b(official|videoclip|clip|video|mix|ft|feat|music|HQ|version|HD|original|extended|unextended|vs|preview|meets|anthem|12\"|4k|audio|rmx|lyrics|lyric|international|1080p)\\b"
result = re.compile(RemoveWords, re.I)
This was kind of a workaround because I just started with Python. 这是一种解决方法,因为我刚开始使用Python。 Now what would be ideal is the following:
现在,理想的情况是:
If the parens contain the words 'remix' or 'edit': don't remove text within parens.
如果括号中包含单词“ remix”或“ edit”:请勿删除括号中的文本。 Otherwise remove everything from the parens including the parens itself.
否则,请从括号中删除所有内容,包括括号本身。
For example, if a title looks like this: 例如,如果标题看起来像这样:
AC/DC - TNT (from Live at River Plate)
AC / DC-TNT(来自River Plate现场直播)
Everything between the parens has to be removed. 括号之间的所有内容都必须删除。
But if a title looks like this: 但是,如果标题看起来像这样:
AC/DC - TNT (Dj Example Remix)
AC / DC-TNT(Dj示例混音)
Don't remove text between parens, because it contains the word remix. 不要删除括号之间的文本,因为其中包含单词remix。
I know how to remove words that match the regex, but I don't know how to keep it between parens or how to delete everything between that if it doesn't contain the given words. 我知道如何删除与正则表达式匹配的单词,但是我不知道如何将其保留在括号之间,或者如果不包含给定单词,则如何删除之间的所有内容。
I've tried looking up on regex to find out how to limit it between parens, but I couldn't figure it out as I'm also new to Regex in general. 我已经尝试过查找正则表达式,以了解如何在括号之间进行限制,但是由于我对Regex还是陌生的,所以我无法弄清楚。
You can try this: 您可以尝试以下方法:
import re
keep_words = ["remix", "edit"]
s = "AC/DC - T.N.T. (Dj Example Remix)"
words = [i.lower() for i in s[s.index("(")+1:s.index(")")].split()]
new_s = re.sub("\((.*?)\)", "", s) if not any(i in keep_words for i in words) else s
Output: 输出:
AC/DC - T.N.T. (Dj Example Remix)
In this case, the code will retain the parenthesis, because a word between them appears in stop_words
. 在这种情况下,代码将保留括号,因为它们之间的一个单词出现在
stop_words
。 However, if s = "AC/DC - TNT (from Live at River Plate)"
, the Output will be: 但是,如果
s = "AC/DC - TNT (from Live at River Plate)"
,则输出为:
AC/DC - T.N.T.
Explanation: 说明:
For this solution, the algorithm finds the content between the parenthesis and splits it. 对于此解决方案,该算法在括号之间找到内容并将其分割。 Then, the code converts all the values to lowercase that exist in that new list.
然后,代码将所有值转换为该新列表中存在的小写字母。 The regular expression works like this:
正则表达式的工作方式如下:
"\(" => escape character: finding the first parenthesis in the string
"(.*?)" => matches all the content between specific strings, in this case the outside parenthesis: \( and \)
"\)" => last parenthesis. It must be escaped by the backslash so that it will not be confused for the command to search between specific tags
If a match is found and any item from keep_words
is not found in between the parenthesis, the regular expression will remove all data between the parenthesis and substitute it with a empty string: ""
如果找到匹配项,
keep_words
在括号之间没有找到来自keep_words
任何项目,则正则表达式将删除括号之间的所有数据,并将其替换为空字符串: ""
The solution using re.finditer()
and re.search()
functions: 使用
re.finditer()
和re.search()
函数的解决方案:
import re
titles = 'AC/DC - T.N.T. (from Live at River Plate) AC/DC - T.N.T. (Dj Example Remix)'
result = titles
for m in re.finditer(r'\([^()]+\)', titles):
if not re.search(r'\b(remix|edit)\b', m.group(), re.I):
result = re.sub(re.escape(m.group()), '', result)
print(result)
The output: 输出:
AC/DC - T.N.T. AC/DC - T.N.T. (Dj Example Remix)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.