简体   繁体   English

如何从括号中删除所有内容(除非包含给定的关键字)

[英]How to remove everything from parens except if it contains given keywords

So I have this piece of code to filter out words from an incoming string: 所以我有这段代码来过滤输入字符串中的单词:

RemoveWords = "\\b(official|videoclip|clip|video|mix|ft|feat|music|HQ|version|HD|original|extended|unextended|vs|preview|meets|anthem|12\"|4k|audio|rmx|lyrics|lyric|international|1080p)\\b"
result = re.compile(RemoveWords, re.I)

This was kind of a workaround because I just started with Python. 这是一种解决方法,因为我刚开始使用Python。 Now what would be ideal is the following: 现在,理想的情况是:

If the parens contain the words 'remix' or 'edit': don't remove text within parens. 如果括号中包含单词“ remix”或“ edit”:请勿删除括号中的文本。 Otherwise remove everything from the parens including the parens itself. 否则,请从括号中删除所有内容,包括括号本身。

For example, if a title looks like this: 例如,如果标题看起来像这样:

AC/DC - TNT (from Live at River Plate) AC / DC-TNT(来自River Plate现场直播)

Everything between the parens has to be removed. 括号之间的所有内容都必须删除。

But if a title looks like this: 但是,如果标题看起来像这样:

AC/DC - TNT (Dj Example Remix) AC / DC-TNT(Dj示例混音)

Don't remove text between parens, because it contains the word remix. 不要删除括号之间的文本,因为其中包含单词remix。

I know how to remove words that match the regex, but I don't know how to keep it between parens or how to delete everything between that if it doesn't contain the given words. 我知道如何删除与正则表达式匹配的单词,但是我不知道如何将其保留在括号之间,或者如果不包含给定单词,则如何删除之间的所有内容。

I've tried looking up on regex to find out how to limit it between parens, but I couldn't figure it out as I'm also new to Regex in general. 我已经尝试过查找正则表达式,以了解如何在括号之间进行限制,但是由于我对Regex还是陌生的,所以我无法弄清楚。

You can try this: 您可以尝试以下方法:

import re


keep_words = ["remix", "edit"]

s = "AC/DC - T.N.T. (Dj Example Remix)"

words = [i.lower() for i in s[s.index("(")+1:s.index(")")].split()]

new_s = re.sub("\((.*?)\)", "", s) if  not any(i in keep_words for i in words) else s

Output: 输出:

AC/DC - T.N.T. (Dj Example Remix)

In this case, the code will retain the parenthesis, because a word between them appears in stop_words . 在这种情况下,代码将保留括号,因为它们之间的一个单词出现在stop_words However, if s = "AC/DC - TNT (from Live at River Plate)" , the Output will be: 但是,如果s = "AC/DC - TNT (from Live at River Plate)" ,则输出为:

AC/DC - T.N.T. 

Explanation: 说明:

For this solution, the algorithm finds the content between the parenthesis and splits it. 对于此解决方案,该算法在括号之间找到内容并将其分割。 Then, the code converts all the values to lowercase that exist in that new list. 然后,代码将所有值转换为该新列表中存在的小写字母。 The regular expression works like this: 正则表达式的工作方式如下:

"\(" => escape character: finding the first parenthesis in the string
"(.*?)" => matches all the content between specific strings, in this case the outside parenthesis: \( and \)
"\)" => last parenthesis. It must be escaped by the backslash so that it will not be confused for the command to search between specific tags

If a match is found and any item from keep_words is not found in between the parenthesis, the regular expression will remove all data between the parenthesis and substitute it with a empty string: "" 如果找到匹配项, keep_words在括号之间没有找到来自keep_words任何项目,则正则表达式将删除括号之间的所有数据,并将其替换为空字符串: ""

The solution using re.finditer() and re.search() functions: 使用re.finditer()re.search()函数的解决方案:

import re
titles = 'AC/DC - T.N.T. (from Live at River Plate) AC/DC - T.N.T. (Dj Example Remix)'
result = titles

for m in re.finditer(r'\([^()]+\)', titles):
    if not re.search(r'\b(remix|edit)\b', m.group(), re.I):
        result = re.sub(re.escape(m.group()), '', result)

print(result)

The output: 输出:

AC/DC - T.N.T.  AC/DC - T.N.T. (Dj Example Remix)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从文本中删除除文字和表情符号之外的所有内容? - How to remove everything except words and emoji from text? 从注释中删除所有内容,但某些字符串除外 - Remove everything from comment except certain string 删除除带有 % 的行之外的所有内容并删除 % - remove everything except lines with % and remove the % 如何从python的html文件中删除除选定标记之外的所有内容? - How can I remove everything except a selected tag from a html file with python? 如何从字符串中删除我想要的所有内容? - How do I remove everything from a string except what I want? Python:从字符串中删除除字母和空格之外的所有内容 - Python: Remove everything except letters and whitespaces from string Python正则表达式从列表中删除除字符串外的所有内容 - Python regex remove everything except strings from list 从包含某些关键字的列表中删除多个字符串元素 - Remove multiple string elements from a list which contains certain keywords 如何删除字符串中最后一个数字之后的所有内容(某些字符除外) - How to remove everything (except certain characters) after the last number in a string 如何从 python 的列表中删除相似的关键字? - How to remove similar keywords from a list in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM