简体   繁体   中英

How to remove everything from parens except if it contains given keywords

So I have this piece of code to filter out words from an incoming string:

RemoveWords = "\\b(official|videoclip|clip|video|mix|ft|feat|music|HQ|version|HD|original|extended|unextended|vs|preview|meets|anthem|12\"|4k|audio|rmx|lyrics|lyric|international|1080p)\\b"
result = re.compile(RemoveWords, re.I)

This was kind of a workaround because I just started with Python. Now what would be ideal is the following:

If the parens contain the words 'remix' or 'edit': don't remove text within parens. Otherwise remove everything from the parens including the parens itself.

For example, if a title looks like this:

AC/DC - TNT (from Live at River Plate)

Everything between the parens has to be removed.

But if a title looks like this:

AC/DC - TNT (Dj Example Remix)

Don't remove text between parens, because it contains the word remix.

I know how to remove words that match the regex, but I don't know how to keep it between parens or how to delete everything between that if it doesn't contain the given words.

I've tried looking up on regex to find out how to limit it between parens, but I couldn't figure it out as I'm also new to Regex in general.

You can try this:

import re


keep_words = ["remix", "edit"]

s = "AC/DC - T.N.T. (Dj Example Remix)"

words = [i.lower() for i in s[s.index("(")+1:s.index(")")].split()]

new_s = re.sub("\((.*?)\)", "", s) if  not any(i in keep_words for i in words) else s

Output:

AC/DC - T.N.T. (Dj Example Remix)

In this case, the code will retain the parenthesis, because a word between them appears in stop_words . However, if s = "AC/DC - TNT (from Live at River Plate)" , the Output will be:

AC/DC - T.N.T. 

Explanation:

For this solution, the algorithm finds the content between the parenthesis and splits it. Then, the code converts all the values to lowercase that exist in that new list. The regular expression works like this:

"\(" => escape character: finding the first parenthesis in the string
"(.*?)" => matches all the content between specific strings, in this case the outside parenthesis: \( and \)
"\)" => last parenthesis. It must be escaped by the backslash so that it will not be confused for the command to search between specific tags

If a match is found and any item from keep_words is not found in between the parenthesis, the regular expression will remove all data between the parenthesis and substitute it with a empty string: ""

The solution using re.finditer() and re.search() functions:

import re
titles = 'AC/DC - T.N.T. (from Live at River Plate) AC/DC - T.N.T. (Dj Example Remix)'
result = titles

for m in re.finditer(r'\([^()]+\)', titles):
    if not re.search(r'\b(remix|edit)\b', m.group(), re.I):
        result = re.sub(re.escape(m.group()), '', result)

print(result)

The output:

AC/DC - T.N.T.  AC/DC - T.N.T. (Dj Example Remix)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM