简体   繁体   中英

Can I use regex to replace all keywords in a string? (Python)

Here is my code:


# case 1
content = "staging_datastorage"
query_term = "st ta ag"

# case 2
# content = "game_event"
# query_term = "gam ame"

terms = re.findall('[a-z0-9]+', query_term, re.I)
terms.sort(key=len, reverse=True)
term_regex = "|".join(terms)
replace_content = re.sub(rf"({term_regex})", r'<em>\1</em>', content, flags = re.I)
print(replace_content)

What I want to do is use the <em> HTML tag to highlight some keywords in a table (called content ) with my input string ( query_term ). The input string contains the keyword I want to highlight and divide by a space .

For the two cases, the results I want are:

case 1:
this is better
<em>stag</em>ing_da<em>tast</em>or<em>ag</em>e
this is also fine(nesting highlight tag): 
<em>s<em>t<em></em>a</em>g</em>ing_da<em>ta<em></em>st</em>or<em>ag</em>e

case 2:
perfect result: 
<em>game</em>_event
fine result: 
<em>g<em>am</em>e</em>_event

My code has a bug: for case 2, it only highlights gam and not nam , this result is not right: <em>gam</em>e_event

I think that this situation is a bit complicated, where one keyword is nesting in another one or one keyword is the beginning (or ending) part of another one.

Can I use regex to solve this?

As I said in the comments, searches are non-overlapping, next found is in the remaining part.

What you can do idea #1 :

re.sub each keyword separately in a loop.

Of course if the searches are overlapping, you could have some <em> or </em> already in the way - like here, ame won't match am</em>e - so you need to modify the single-keyword regexes. Include (?:</?em>)? between letters.

terms = re.findall('[a-z0-9]+', query_term, re.I)
terms.sort(key=len, reverse=True)
replace_content = content
for term in terms:
    term_regex = "(?:</?em>)?".join(term)
    replace_content = re.sub(rf"({term_regex})", r'<em>\1</em>', replace_content, flags = re.I)

print(replace_content)

Results for both cases:

<em>s<em>t</em><em>a</em>g</em>ing_da<em>ta</em><em>st</em>or<em>ag</em>e
<em>g<em>am</em>e</em>_event

Idea #2

You could pre-process the keywords themselves, find which prefixes match the suffixes, and merge those in another keywords.

Here: gam has suffix am , ame has prefix am -> you add game to your terms.

This idea would give that "perfect result"


Idea #3*

Do the idea #1, remove nested highlights and merge those just next to each other (ie remove </em><em> ).

This idea would give that "perfect result" as well.

To remove one level of nesting, do:

re.sub(r"<em>([^/]*)<em>([^/]*)</em>([^/]*)</em>", r"<em>\1\2\3</em>", replace_content, flags = re.I)

The regex works by finding tags in the order of <em> <em> </em> </em> (so nested) with any groups of characters without / between them (a quick way to make sure we're taking only the nearest closing tag).

Obviously, with only one level of nesting removed, we need to use this in a loop as well - this would be a while loop: while replaces is different from last time, replace again = stops when replace doesn't make changes anymore.

final_result = ""
while final_result != replace_content:
    final_result = replace_content
    replace_content = re.sub(r"<em>([^/]*)<em>([^/]*)</em>([^/]*)</em>", r"<em>\1\2\3</em>", final_result, flags = re.I)

print(final_result)

Case2 has only one replacement needed, so let's see how it works on case1:

<em>stag</em>ing_da<em>ta</em><em>st</em>or<em>ag</em>e

Now this only needs the </em><em> removal, as I mentioned!

Final piece of code to put after idea #1 code:

final_result = ""
while final_result != replace_content:
    final_result = replace_content
    replace_content = re.sub(r"<em>([^/]*)<em>([^/]*)</em>([^/]*)</em>", r"<em>\1\2\3</em>", final_result, flags = re.I)

final_result = final_result.replace("</em><em>", "")
print(final_result)

Gives:

<em>stag</em>ing_da<em>tast</em>or<em>ag</em>e

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM