I am trying to make my own lemmatizer for Spanish in Python2.7
using a lemmatization dictionary.
I would like to replace all of the words in a certain text with their lemma form. This is the code that I have been working on so far.
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()
lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
for line in f:
depurated_line = line.rstrip()
(val, key) = depurated_line.split("\t")
lemmatize_word_dict[key] = val
txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt
Here is an example dictionary
file which contains the lemmatized forms used to replace the words in the input
, or my_tyext_lower
. The example dictionary is a tab-separated 2-column file in which Col. 1 Represented the values and Col 2 represents the keys to match.
ExampleDictionary
flojo floja
flojo flojas
flojo flojos
cargamento cargamentos
cargante cargantes
decepción decepciones
decepcionante decepcionantes
decentar decenté
decentar decentéis
decentar decentemos
decentar decentó
My desired output is as follows:
flojo y cargante. decepcionante. decentar decentar
Using these inputs (and the example phrase, as listed in my_text
within the code). My actual output currently is:
felitrojo y cargramarramarrartserargramarramarrunirdo. decepáginacionarrtícolitroargramarramarrunirdo. decentar decentar
Currently, I can't seem to understand what it going wrong with the code.
It seems that it is replacing letters or chunks of each word, instead of recognizing the word, finding it in the lemma dictionary
and then replace that instead.
For instance, this is the result that I am getting when I use the entire dictionary (more than 50.000 entries). This problem does not happen with my small example dictionary. Only when I use the complete dictionary which makes me think that prehaps it is double "replacing" at some point?
Is there a pythonic technique that I am missing and can incorporate into this code to make my search and replace function more precise, to identify the full words for replacement rather than chunks and/or NOT make any double replacements?
I see two problems with your code:
Instead of that loop, I suggest using re.sub
with word boundaries \\b
to make sure that you replace complete words only. This way, you can also pass a callable as a replacement function.
import re
def replace_all(text, dic):
return re.sub(r"\b\w+\b", lambda m: dic.get(m.group(), m.group()), text)
Because you use text.replace there's a chance that you'll still be matching a sub-string, and the text will get processed again. It's better to process one input word at a time and build the output string word-by-word.
I've switched your key-value the other way around (because you want to look up the right and find the word on the left), and I mainly changed the replace_all:
import re
def replace_all(text, dic):
result = ""
input = re.findall(r"[\w']+|[.,!?;]", text)
for word in input:
changed = dic.get(word,word)
result = result + " " + changed
return result
my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()
lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
for line in f:
kv = line.split()
lemmatize_word_dict[kv[1]] =kv[0]
txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.