简体   繁体   English

如何使用字典替换字符串中的复合词?

[英]How to replace compound words in a string using a dictionary?

I have a dictionary whose key:value pairs correspond to compound words and the expression i want to replace them for in a text.我有一个字典,其键:值对对应于复合词和我想在文本中替换它们的表达式。 For example let's say:例如让我们说:

terms_dict = {'digi conso': 'digi conso', 'digi': 'digi conso', 'digiconso': 'digi conso', '3xcb': '3xcb', '3x cb': '3xcb', 'legal entity identifier': 'legal entity identifier'}

My goal is to create a function replace_terms(text, dict) that takes a text and a dictionary like this one as parameters, and returns the text after replacing the compound words.我的目标是创建一个函数 replace_terms(text, dict),它将文本和像这样的字典作为参数,并在替换复合词后返回文本。

For instance, this script:例如,这个脚本:

test_text = "i want a digi conso loan for digiconso" 

print(replace_terms(test_text, terms_dict))

Should return:应该返回:

"i want a digi conso loan for digi conso"

I have tried using .replace() but for some reasons it doesn't work properly, probably because the terms to replace are composed of multiple words.我曾尝试使用 .replace() 但由于某些原因它无法正常工作,可能是因为要替换的术语由多个单词组成。

I also tried this:我也试过这个:

def replace_terms(text, terms_dict):
    if len(terms_dict) > 0:
        words_in = [k for k in terms_dict.keys() if k in text]  # ex: words_in = [digi conso, digi, digiconso]
        if len(words_in) > 0:
            for w in words_in:
                pattern = r"\b" + w + r"\b"
                text = re.sub(pattern, terms_dict[w], text)

    return text

But when applied to my text, this function returns: "i want a digi conso conso loan for digi conso" , the word conso get's doubled and I can see why (because the words_in list is created by going through the dictionary keys, and the text is not altered when one key is appended to the list).但是当应用于我的文本时,此函数返回: “i want a digi conso loan for digi conso” ,单词conso get 翻了一番,我可以看到原因(因为 words_in 列表是通过字典键创建的,而将一个键附加到列表时不会更改文本)。

Is there an efficient way to do this?有没有一种有效的方法来做到这一点?

Thanks a lot!非常感谢!

This should do it.这应该这样做。


terms_dict = { 'digiconso': 'digi conso', '3xcb': '3xcb', '3x cb': '3xcb', 'legal entity identifier': 'legal entity identifier'}
test_text = "i want a digi conso loan for digiconso" 
def replace_terms(txt, dct):
    dct = tuple(dct.items())
    for x, y in dct:
        txt = txt.replace(x, y, 1)
    return txt
print(replace_terms(test_text, terms_dict))

First I get the dict pairs and get them in a easier form(tuple).首先,我得到字典对并以更简单的形式(元组)得到它们。 Then I iter and replace!然后我迭代并替换!

Output:输出:

i want a digi conso loan for digi conso

You had to many extra replace identifiers which you did not need.您必须更换许多您不需要的额外标识符。 I also made it only replace 1 but you can change that.我也让它只替换 1 但你可以改变它。

A rather quick and wonky way of doing this:这样做的一种相当快速和不稳定的方式:

def replace_terms(text, terms):
    replacement_list = []
    check = True
    for term in terms:
        if term in text:
            for r in replacement_list:
                if r[0] == text.index(term):
                    if len(term) > len(r[1]):
                        replacement_list.remove(r)
                    else:
                        check = False
            if check:
                replacement_list.append([text.index(term), term])
            else:
                check = True
    for r in replacement_list:
        text = text.replace(r[1], terms[r[1]])
    return text

Usage:用法:

terms_dict = {
    "digi conso": "digi conso",
    "digi": "digi conso",
    "digiconso": "digi conso",
    "3xcb": "3xcb",
    "3x cb": "3xcb",
    "legal entity identifier": "legal entity identifier"
}

test_text = "i want a digi conso loan for digiconso"

print(replace_terms(test_text, terms_dict))

Result:结果:

i want a digi conso loan for digi conso

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM