简体   繁体   English

使用正则表达式为多个字符串替换指定单词边界?

[英]Specifying word boundaries for multiple string replacement with regex?

I'm trying to mask city names in a list of texts using 'PAddress' tags.我正在尝试使用“PAddress”标签在文本列表中屏蔽城市名称。 To do this, I borrowed thejonny's solution here for how to perform multiple regex substitutions using a dictionary with regex expressions as keys.为此,我在这里借用了 thejonny 的解决方案,了解如何使用以正则表达式作为键的字典执行多个正则表达式替换。 In my implementation, the cities are keys and the values are tags that correspond to the exact format of the keys (this is important because the format must be preserved down the line).在我的实现中,城市是键,值是与键的确切格式相对应的标签(这很重要,因为必须保留格式)。 Eg., {East-Barrington: PAddress-PAddress}, so East-Barrington would be replaced by PAddress-PAddress;例如,{East-Barrington: PAddress-PAddress},因此 East-Barrington 将被替换为 PAddress-PAddress; one tag per word with punctuation and spacing preserved.每个单词一个标记,保留标点符号和间距。 Below is my code - sub_mult_regex() is the helper function called by mask_multiword_cities().下面是我的代码——sub_mult_regex() 是 mask_multiword_cities() 调用的助手 function。

def sub_mult_regex(text, keys, tag_type):
    '''
    Replaces/masks multiple words at once
    Parameters:
        Text: TIU note
        Keys: a list of words to be replaced by the regex
        Tag_type: string you want the words to be replaced with
    Creates a replacement dictionary of keys and values 
    (values are the length of the key, preserving formatting).
    Eg., {68 Oak St., PAddress PAddress PAddress.,}
    Returns text with relevant text masked
    '''
    # Creating a list of values to correspond with keys (see key:value example in docstring)

    add_vals = []
    for val in keys:
        add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags

    # Zipping keys and values together as dictionary
    add_dict = dict(zip(keys, add_vals))

    # Compiling the keys together (regex)
    add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)

    # This is where the multiple substitutions are happening
    # Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
    group_index = 1
    indexed_subs = {}
    for target, sub in add_dict.items():
        indexed_subs[group_index] = sub
        group_index += re.compile(target).groups + 1
    if len(indexed_subs) > 0:
        text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked 
    else:
        text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise

    # Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)

    case_a = text
    case_b = text_sub

    diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
    diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]

    return text_sub, diff_list 

 

def mask_multiword_cities(text_string):
    multi_word_cities = list(set([city for city in us_cities_all if len(city.split(' ')) > 1 and len(city) > 3 and "Mc" not in city and "State" not in city and city != 'Mary D']))
    return sub_mult_regex(text_string, multi_word_cities, "PAddress")

The problem is, the keys in the regex dictionary don't have word boundaries specified, so while only exact matches should be tagged (case insensitive), phrases like 'around others' gets tagged because it thinks that the city 'Round O' is in it (technically that is a substring within that).问题是,正则表达式字典中的键没有指定单词边界,因此虽然只应标记完全匹配(不区分大小写),但像“around others”这样的短语会被标记,因为它认为城市“Round O”是在其中(从技术上讲,这是其中的 substring)。 Take this example text, run through the mask_multiword_cities function:拿这个例子文本,跑遍mask_multiword_cities function:

add_string = "The cities are Round O , NJ and around others"

mask_multiword_cities(add_string)

#(output): ('The cities are PAddress PAddress NJ , and aPAddress PAddressthers', [' Round', ' O', ' around', ' others'])

The output should only be ('The cities are PAddress PAddress NJ, and around others', [' Round', ' O']) . output 应该只是('The cities are PAddress PAddress NJ, and around others', [' Round', ' O']) I've tried converting each key to a regex expression like r"\b(?=\w)key\b(?!\w)" at various points in the sub_mult_regex function (lines 26 and 37) but that didn't work as expected.我尝试在 sub_mult_regex function(第 26 和 37 行)的不同点将每个键转换为正则表达式,如r"\b(?=\w)key\b(?!\w)"但那没有按预期工作。

For testing, assume that: us_cities_all = ['Great Barrington', 'Round O', 'East Orange'] .对于测试,假设: us_cities_all = ['Great Barrington', 'Round O', 'East Orange']

Also, if anyone can help make this run faster/be more efficient, that would be great, Right now, it takes about 30 seconds to run on a 1000-word note, likely because us_cities_all contains 5.000 cities, Let me know if it would be more helpful to directly post the cities list.另外,如果有人可以帮助使这个运行更快/更有效率,那就太好了,现在,在 1000 字的笔记上运行大约需要 30 秒,可能是因为 us_cities_all 包含 5.000 个城市,让我知道它是否会直接发布城市列表更有帮助。 I wasn't sure how to do so.我不知道该怎么做。

you can partially extract the words and combine them later.您可以部分提取单词并稍后将它们组合起来。 I have added the example code based on your cases.我已经根据您的情况添加了示例代码。 it will fail if your add_string has no space btw words.如果您的add_string没有空格 btw 单词,它将失败。

example code:示例代码:

import re


# replace the string
def replacer(string, noise_list):
    for v in noise_list:
        string = string.replace(v, "PAddress")
    return string


def multi_mask(multi_word_cities, add_string):
    for city in multi_word_cities:
        if city in add_string:
            city_data = city.split()
            add_string_split = add_string.split()
            matched_city_data = [i for i in add_string_split if any((j == i) for j in city_data)]
            city_index = add_string_split.index(matched_city_data[1])
            new_string = ' '.join(add_string_split[:city_index + 1])
            replaced_data = replacer(new_string, matched_city_data)
            capital_string = ''.join(re.findall(r'[A-Z]{2}', add_string))
            index_of_and = add_string_split.index("and")
            text_after_and = ' '.join(add_string_split[index_of_and:])
            return replaced_data + ' ' + capital_string, text_after_and, matched_city_data


us_cities_all = ['Great Barrington', 'Round O', 'East Orange']
multi_word_cities = list(set([city for city in us_cities_all if len(city.split(' ')) > 1 and len(
    city) > 3 and "Mc" not in city and "State" not in city and city != 'Mary D']))
add_string = "The hospital is in East Orange and around o"

print(multi_mask(multi_word_cities, add_string))

>>> ('The hospital is in PAddress PAddress ', 'and around o', ['East', 'Orange'])

I figured out a word-boundary based solution that would handle multiple cities, in case anyone might find it helpful in a similar situation:我想出了一个基于单词边界的解决方案,可以处理多个城市,以防有人在类似情况下发现它有帮助:

def sub_mult_regex(text, keys, tag_type, city):
    '''
    Replaces/masks multiple words at once
    Parameters:
        text: TIU note
        keys: a list of words to be replaced by the regex
        tag_type: string you want the words to be replaced with
        city: bool, True if replacing cities, False if replacing anything else

    Creates a replacement dictionary of keys and values 
    (values are the length of the key, preserving formatting).

    Eg., {68 Oak St, PAddress PAddress PAddress}

    Returns text with relevant text masked
    '''

    # Creating a list of values to correspond with keys (see key:value example in docstring)

    if city:
        # If we're masking a city, handle word boundaries
        # This step of only including keys if they show up in the text speeds the code up by a lot, since it's not cross-referencing against thousands of cities, only the ones present
        keys = [r"\b"+key+r"\b" for key in keys if key in text or key.upper() in text] # add word boundaries for each key in list
        add_vals = []
        for val in keys:
            # Create dictionary of city word:PAddress by splitting the city on the '\\b' char that remains and then adding one tag per word
            # Ex: '\\bDeer Island\\b' --> split('\\b') --> ['', 'Deer Island', ''] --> ''.join --> (key) Deer Island : (value) PAddress PAddress
            add_vals.append(re.sub(r'\w{1,100}', tag_type, ''.join(val.split('\\b')))) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
        add_vals = [re.sub(r'\\b', "", val) for val in add_vals]

    elif not city:
        # If we're not masking a city, we don't do the word boundary step
        add_vals = []
        for val in keys:
            add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags

    # Zipping keys and values together as dictionary
    add_dict = dict(zip(keys, add_vals))
    print("add_dict:", add_dict)

    # Compiling the keys together (regex)
    add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)

    # This is where the multiple substitutions are happening
    # Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys

    group_index = 1
    indexed_subs = {}
    for target, sub in add_dict.items():
        indexed_subs[group_index] = sub
        group_index += re.compile(target).groups + 1
    if len(indexed_subs) > 0:
        text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked text

    else:
        text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise

    # Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)

    case_a = text
    case_b = text_sub

    diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
    diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]

 
    return text_sub, diff_list 
# sample call:
add_string = 'The cities are Round O NJ, around others and East Orange'
mask_multiword_cities(add_string) # this function remained the same 

# output: add_dict: {'\\bEast Orange\\b': 'PAddress PAddress', '\\bRound O\\b': 'PAddress PAddress'} ('The cities are PAddress PAddress NJ, around others are PAddress PAddress', [' Round', ' O', ' East', ' Orange'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM