简体   繁体   中英

Python Regex, how to substitute multiple occurrences with a single pattern?

I'm trying to make a fuzzy autocomplete suggestion box that highlights searched characters with HTML tags <b></b>

For example, if the user types 'ldi' and one of the suggestions is "Leonardo DiCaprio" then the desired outcome is " L eonar d o D i Caprio". The first occurrence of each character is highlighted in order of appearance.

What I'm doing right now is:

def prototype_finding_chars_in_string():
    test_string_list = ["Leonardo DiCaprio", "Brad Pitt","Claire Danes","Tobey Maguire"]
    comp_string = "ldi" #chars to highlight
    regex = ".*?" + ".*?".join([f"({x})" for x in comp_string]) + ".*?" #results in .*?(l).*?(d).*?(i).*
    regex_compiled = re.compile(regex, re.IGNORECASE)
    for x in test_string_list:
        re_search_result = re.search(regex_compiled, x) # correctly filters the test list to include only entries that features the search chars in order
        if re_search_result:
            print(f"char combination {comp_string} are in {x} result group: {re_search_result.groups()}")

results in

char combination ldi are in Leonardo DiCaprio result group: ('L', 'D', 'i')

Now I want to replace each occurrence in the result groups with <b>[whatever in the result]</b> and I'm not sure how to do it.

What I'm currently doing is looping over the result and using the built-in str.replace method to replace the occurrences:

def replace_with_bold(result_groups, original_string):
    output_string: str = original_string
    for result in result_groups:
        output_string = output_string.replace(result,f"<b>{result}</b>",1)
    
    return output_string

This results in:

Highlighted string: <b>L</b>eonar<b>d</b>o D<b>i</b>Caprio

But I think looping like this over the results when I already have the match groups is wasteful. Furthermore, it's not even correct because it checked the string from the beginning each loop. So for the input 'ooo' this is the result:

char combination ooo are in Leonardo DiCaprio result group: ('o', 'o', 'o')
Highlighted string: Le<b><b><b>o</b></b></b>nardo DiCaprio

When it should be Le<b>o</b>nard<b>o</b> DiCapri<b>o</b>

Is there a way to simplify this? Maybe regex here is overkill?

This should work:

for result in result_groups:
    output_string = re.sub(fr'(.*?(?!<b>))({result})((?!</b>).*)',
         r'\1<b>\2</b>\3',
         output_string,
         flags=re.IGNORECASE)

on each iteration first occurrence of result ( ? makes .* lazy this together does the magic of first occurrence) will be replaced by <b>result</b> if it is not enclosed by tag before ( (?!<b>) and (?!</b>) does that part) and \1 \2 \3 are first, second and third group additionally we will use IGNORECASE flag to make it case insensitive.

A way using re.split:

test_string_list = ["Leonardo DiCaprio", "Brad Pitt", "Claire Danes", "Tobey Maguire"]

def filter_and_highlight(strings, letters):
    
    pat = re.compile( '(' + (')(.*?)('.join(letters)) + ')', re.I)
    
    results = []
    
    for s in strings:
        parts = pat.split(s)
        
        if len(parts) == 1: continue
        
        res = ''
        for i, p in enumerate(parts):
            if i & 1:
                p = '<b>' + p + '</b>'
                
            res += p
            
        results.append(res)
        
    return results

filter_and_highlight(test_string_list, 'lir')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM