Here is my code:
# case 1
content = "staging_datastorage"
query_term = "st ta ag"
# case 2
# content = "game_event"
# query_term = "gam ame"
terms = re.findall('[a-z0-9]+', query_term, re.I)
terms.sort(key=len, reverse=True)
term_regex = "|".join(terms)
replace_content = re.sub(rf"({term_regex})", r'<em>\1</em>', content, flags = re.I)
print(replace_content)
What I want to do is use the <em>
HTML tag to highlight some keywords in a table (called content
) with my input string ( query_term
). The input string contains the keyword I want to highlight and divide by a space .
For the two cases, the results I want are:
case 1:
this is better
<em>stag</em>ing_da<em>tast</em>or<em>ag</em>e
this is also fine(nesting highlight tag):
<em>s<em>t<em></em>a</em>g</em>ing_da<em>ta<em></em>st</em>or<em>ag</em>e
case 2:
perfect result:
<em>game</em>_event
fine result:
<em>g<em>am</em>e</em>_event
My code has a bug: for case 2, it only highlights gam
and not nam
, this result is not right: <em>gam</em>e_event
I think that this situation is a bit complicated, where one keyword is nesting in another one or one keyword is the beginning (or ending) part of another one.
Can I use regex to solve this?
As I said in the comments, searches are non-overlapping, next found is in the remaining part.
What you can do idea #1 :
re.sub
each keyword separately in a loop.
Of course if the searches are overlapping, you could have some <em>
or </em>
already in the way - like here, ame
won't match am</em>e
- so you need to modify the single-keyword regexes. Include (?:</?em>)?
between letters.
terms = re.findall('[a-z0-9]+', query_term, re.I)
terms.sort(key=len, reverse=True)
replace_content = content
for term in terms:
term_regex = "(?:</?em>)?".join(term)
replace_content = re.sub(rf"({term_regex})", r'<em>\1</em>', replace_content, flags = re.I)
print(replace_content)
Results for both cases:
<em>s<em>t</em><em>a</em>g</em>ing_da<em>ta</em><em>st</em>or<em>ag</em>e
<em>g<em>am</em>e</em>_event
Idea #2
You could pre-process the keywords themselves, find which prefixes match the suffixes, and merge those in another keywords.
Here: gam
has suffix am
, ame
has prefix am
-> you add game
to your terms.
This idea would give that "perfect result"
Idea #3*
Do the idea #1, remove nested highlights and merge those just next to each other (ie remove </em><em>
).
This idea would give that "perfect result" as well.
To remove one level of nesting, do:
re.sub(r"<em>([^/]*)<em>([^/]*)</em>([^/]*)</em>", r"<em>\1\2\3</em>", replace_content, flags = re.I)
The regex works by finding tags in the order of <em>
<em>
</em>
</em>
(so nested) with any groups of characters without /
between them (a quick way to make sure we're taking only the nearest closing tag).
Obviously, with only one level of nesting removed, we need to use this in a loop as well - this would be a while
loop: while replaces is different from last time, replace again = stops when replace doesn't make changes anymore.
final_result = ""
while final_result != replace_content:
final_result = replace_content
replace_content = re.sub(r"<em>([^/]*)<em>([^/]*)</em>([^/]*)</em>", r"<em>\1\2\3</em>", final_result, flags = re.I)
print(final_result)
Case2 has only one replacement needed, so let's see how it works on case1:
<em>stag</em>ing_da<em>ta</em><em>st</em>or<em>ag</em>e
Now this only needs the </em><em>
removal, as I mentioned!
Final piece of code to put after idea #1 code:
final_result = ""
while final_result != replace_content:
final_result = replace_content
replace_content = re.sub(r"<em>([^/]*)<em>([^/]*)</em>([^/]*)</em>", r"<em>\1\2\3</em>", final_result, flags = re.I)
final_result = final_result.replace("</em><em>", "")
print(final_result)
Gives:
<em>stag</em>ing_da<em>tast</em>or<em>ag</em>e
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.