简体   繁体   中英

Making multiple search and replace more precise in Python for lemmatizer

I am trying to make my own lemmatizer for Spanish in Python2.7 using a lemmatization dictionary.

I would like to replace all of the words in a certain text with their lemma form. This is the code that I have been working on so far.

def replace_all(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i, j)
    return text


my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()

lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
    for line in f:
        depurated_line = line.rstrip()
        (val, key) = depurated_line.split("\t")
        lemmatize_word_dict[key] = val

txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt

Here is an example dictionary file which contains the lemmatized forms used to replace the words in the input , or my_tyext_lower . The example dictionary is a tab-separated 2-column file in which Col. 1 Represented the values and Col 2 represents the keys to match.

ExampleDictionary

flojo   floja
flojo   flojas
flojo   flojos
cargamento  cargamentos
cargante    cargantes
decepción   decepciones
decepcionante   decepcionantes
decentar    decenté
decentar    decentéis
decentar    decentemos
decentar    decentó

My desired output is as follows:

flojo y cargante. decepcionante. decentar decentar

Using these inputs (and the example phrase, as listed in my_text within the code). My actual output currently is:

felitrojo y cargramarramarrartserargramarramarrunirdo. decepáginacionarrtícolitroargramarramarrunirdo. decentar decentar

Currently, I can't seem to understand what it going wrong with the code.

It seems that it is replacing letters or chunks of each word, instead of recognizing the word, finding it in the lemma dictionary and then replace that instead.

For instance, this is the result that I am getting when I use the entire dictionary (more than 50.000 entries). This problem does not happen with my small example dictionary. Only when I use the complete dictionary which makes me think that prehaps it is double "replacing" at some point?

Is there a pythonic technique that I am missing and can incorporate into this code to make my search and replace function more precise, to identify the full words for replacement rather than chunks and/or NOT make any double replacements?

I see two problems with your code:

  • it will also replace words if they appear as part of a bigger word
  • by replacing words one after the other, you could replace (parts of) words that have already been replaced

Instead of that loop, I suggest using re.sub with word boundaries \\b to make sure that you replace complete words only. This way, you can also pass a callable as a replacement function.

import re
def replace_all(text, dic):
    return re.sub(r"\b\w+\b", lambda m: dic.get(m.group(), m.group()), text)

Because you use text.replace there's a chance that you'll still be matching a sub-string, and the text will get processed again. It's better to process one input word at a time and build the output string word-by-word.

I've switched your key-value the other way around (because you want to look up the right and find the word on the left), and I mainly changed the replace_all:

import re

def replace_all(text, dic):
    result = ""
    input = re.findall(r"[\w']+|[.,!?;]", text)
    for word in input:
        changed = dic.get(word,word)
        result = result + " " + changed
    return result

my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()

lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
    for line in f:
        kv = line.split()
        lemmatize_word_dict[kv[1]] =kv[0]

    txt = replace_all(my_text_lower, lemmatize_word_dict)
    print txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM