Can I use regex to replace all keywords in a string? (Python)

Question

Here is my code:


# case 1
content = "staging_datastorage"
query_term = "st ta ag"

# case 2
# content = "game_event"
# query_term = "gam ame"

terms = re.findall('[a-z0-9]+', query_term, re.I)
terms.sort(key=len, reverse=True)
term_regex = "|".join(terms)
replace_content = re.sub(rf"({term_regex})", r'<em>\1</em>', content, flags = re.I)
print(replace_content)

What I want to do is use the  HTML tag to highlight some keywords in a table (called content ) with my input string ( query_term ). The input string contains the keyword I want to highlight and divide by a space .

For the two cases, the results I want are:

case 1:
this is better
<em>stag</em>ing_da<em>tast</em>or<em>ag</em>e
this is also fine(nesting highlight tag): 
<em>s<em>t<em></em>a</em>g</em>ing_da<em>ta<em></em>st</em>or<em>ag</em>e

case 2:
perfect result: 
<em>game</em>_event
fine result: 
<em>g<em>am</em>e</em>_event

My code has a bug: for case 2, it only highlights gam and not nam , this result is not right: game_event

I think that this situation is a bit complicated, where one keyword is nesting in another one or one keyword is the beginning (or ending) part of another one.

Can I use regex to solve this?

Answer 1

As I said in the comments, searches are non-overlapping, next found is in the remaining part.

What you can do idea #1 :

re.sub each keyword separately in a loop.

Of course if the searches are overlapping, you could have some  or  already in the way - like here, ame won't match ame - so you need to modify the single-keyword regexes. Include (?:</?em>)? between letters.

terms = re.findall('[a-z0-9]+', query_term, re.I)
terms.sort(key=len, reverse=True)
replace_content = content
for term in terms:
    term_regex = "(?:</?em>)?".join(term)
    replace_content = re.sub(rf"({term_regex})", r'<em>\1</em>', replace_content, flags = re.I)

print(replace_content)

Results for both cases:

<em>s<em>t</em><em>a</em>g</em>ing_da<em>ta</em><em>st</em>or<em>ag</em>e

<em>g<em>am</em>e</em>_event

Idea #2

You could pre-process the keywords themselves, find which prefixes match the suffixes, and merge those in another keywords.

Here: gam has suffix am , ame has prefix am -> you add game to your terms.

This idea would give that "perfect result"

Idea #3*

Do the idea #1, remove nested highlights and merge those just next to each other (ie remove  ).

This idea would give that "perfect result" as well.

To remove one level of nesting, do:

re.sub(r"<em>([^/]*)<em>([^/]*)</em>([^/]*)</em>", r"<em>\1\2\3</em>", replace_content, flags = re.I)

The regex works by finding tags in the order of     (so nested) with any groups of characters without / between them (a quick way to make sure we're taking only the nearest closing tag).

Obviously, with only one level of nesting removed, we need to use this in a loop as well - this would be a while loop: while replaces is different from last time, replace again = stops when replace doesn't make changes anymore.

final_result = ""
while final_result != replace_content:
    final_result = replace_content
    replace_content = re.sub(r"<em>([^/]*)<em>([^/]*)</em>([^/]*)</em>", r"<em>\1\2\3</em>", final_result, flags = re.I)

print(final_result)

Case2 has only one replacement needed, so let's see how it works on case1:

<em>stag</em>ing_da<em>ta</em><em>st</em>or<em>ag</em>e

Now this only needs the  removal, as I mentioned!

Final piece of code to put after idea #1 code:

final_result = ""
while final_result != replace_content:
    final_result = replace_content
    replace_content = re.sub(r"<em>([^/]*)<em>([^/]*)</em>([^/]*)</em>", r"<em>\1\2\3</em>", final_result, flags = re.I)

final_result = final_result.replace("</em><em>", "")
print(final_result)

Gives:

<em>stag</em>ing_da<em>tast</em>or<em>ag</em>e

Can I use regex to replace all keywords in a string? (Python)

Question

1 answers

solution1
1 ACCPTED 2020-03-20 09:50:17

Can I use regex to replace all keywords in a string? (Python)

Question

1 answers

solution1 1 ACCPTED 2020-03-20 09:50:17

solution1
1 ACCPTED 2020-03-20 09:50:17