简体   繁体   English

Python 正则表达式,如何用单个模式替换多次出现?

[英]Python Regex, how to substitute multiple occurrences with a single pattern?

I'm trying to make a fuzzy autocomplete suggestion box that highlights searched characters with HTML tags <b></b>我正在尝试制作一个模糊的自动完成建议框,突出显示带有 HTML 标签的搜索字符 <b></b>

For example, if the user types 'ldi' and one of the suggestions is "Leonardo DiCaprio" then the desired outcome is " L eonar d o D i Caprio".例如,如果用户键入“ldi”并且其中一个建议是“Leonardo DiCaprio”,那么期望的结果是“ L eonar d o D i Caprio”。 The first occurrence of each character is highlighted in order of appearance.每个字符的第一次出现按出现顺序突出显示。

What I'm doing right now is:我现在正在做的是:

def prototype_finding_chars_in_string():
    test_string_list = ["Leonardo DiCaprio", "Brad Pitt","Claire Danes","Tobey Maguire"]
    comp_string = "ldi" #chars to highlight
    regex = ".*?" + ".*?".join([f"({x})" for x in comp_string]) + ".*?" #results in .*?(l).*?(d).*?(i).*
    regex_compiled = re.compile(regex, re.IGNORECASE)
    for x in test_string_list:
        re_search_result = re.search(regex_compiled, x) # correctly filters the test list to include only entries that features the search chars in order
        if re_search_result:
            print(f"char combination {comp_string} are in {x} result group: {re_search_result.groups()}")

results in结果是

char combination ldi are in Leonardo DiCaprio result group: ('L', 'D', 'i')

Now I want to replace each occurrence in the result groups with <b>[whatever in the result]</b> and I'm not sure how to do it.现在我想用<b>[whatever in the result]</b>替换结果组中的每个匹配项,但我不知道该怎么做。

What I'm currently doing is looping over the result and using the built-in str.replace method to replace the occurrences:我目前正在做的是循环结果并使用内置的str.replace方法来替换出现的情况:

def replace_with_bold(result_groups, original_string):
    output_string: str = original_string
    for result in result_groups:
        output_string = output_string.replace(result,f"<b>{result}</b>",1)
    
    return output_string

This results in:这导致:

Highlighted string: <b>L</b>eonar<b>d</b>o D<b>i</b>Caprio

But I think looping like this over the results when I already have the match groups is wasteful.但是我认为当我已经有了匹配组时,像这样循环结果是浪费的。 Furthermore, it's not even correct because it checked the string from the beginning each loop.此外,它甚至不正确,因为它从每个循环的开头检查字符串。 So for the input 'ooo' this is the result:因此,对于输入 'ooo',结果如下:

char combination ooo are in Leonardo DiCaprio result group: ('o', 'o', 'o')
Highlighted string: Le<b><b><b>o</b></b></b>nardo DiCaprio

When it should be Le<b>o</b>nard<b>o</b> DiCapri<b>o</b>什么时候应该是Le<b>o</b>nard<b>o</b> DiCapri<b>o</b>

Is there a way to simplify this?有没有办法简化这个? Maybe regex here is overkill?也许这里的正则表达式有点矫枉过正?

This should work:这应该有效:

for result in result_groups:
    output_string = re.sub(fr'(.*?(?!<b>))({result})((?!</b>).*)',
         r'\1<b>\2</b>\3',
         output_string,
         flags=re.IGNORECASE)

on each iteration first occurrence of result ( ? makes .* lazy this together does the magic of first occurrence) will be replaced by <b>result</b> if it is not enclosed by tag before ( (?!<b>) and (?!</b>) does that part) and \1 \2 \3 are first, second and third group additionally we will use IGNORECASE flag to make it case insensitive.在每次迭代中,第一次出现的结果( ?使得.*懒惰,这一起产生了第一次出现的魔力)如果在 ( (?!<b>) <b>result</b>(?!</b>)做那部分) 和\1 \2 \3是第一组,第二组和第三组,另外我们将使用IGNORECASE标志使其不区分大小写。

A way using re.split:一种使用 re.split 的方法:

test_string_list = ["Leonardo DiCaprio", "Brad Pitt", "Claire Danes", "Tobey Maguire"]

def filter_and_highlight(strings, letters):
    
    pat = re.compile( '(' + (')(.*?)('.join(letters)) + ')', re.I)
    
    results = []
    
    for s in strings:
        parts = pat.split(s)
        
        if len(parts) == 1: continue
        
        res = ''
        for i, p in enumerate(parts):
            if i & 1:
                p = '<b>' + p + '</b>'
                
            res += p
            
        results.append(res)
        
    return results

filter_and_highlight(test_string_list, 'lir')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM