简体   繁体   中英

How to solve Spanish lemmatization problems with SpaCy?

When trying lemmatize in Spanish a csv with more than 60,000 words, SpaCy does not correctly write certain words, I understand that the model is not 100% accurate. However, I have not found any other solution, since NLTK does not bring a Spanish core.

A friend tried to ask this question in Spanish Stackoverflow, however, the community is quite small compared with this community, and we got no answers about it.

code:

nlp = spacy.load('es_core_news_sm')

def lemmatizer(text):  
  doc = nlp(text)
  return ' '.join([word.lemma_ for word in doc])

df['column'] = df['column'].apply(lambda x: lemmatizer(x))

I tried to lemmatize certain words that I found wrong to prove that SpaCy is not doing it correctly:

text = 'personas, ideas, cosas' 
# translation: persons, ideas, things

print(lemmatizer(text))
# Current output:
personar , ideo , coser 
# translation:
personify, ideo, sew

# The expected output should be:
persona, idea, cosa

# translation: 
person, idea, thing

Unlike the English lemmatizer, spaCy's Spanish lemmatizer does not use PoS information at all. It relies on a lookup list of inflected verbs and lemmas (eg, ideo idear, ideas idear, idea idear, ideamos idear, etc.). It will just output the first match in the list, regardless of its PoS.

I actually developed spaCy's new rule-based lemmatizer for Spanish, which takes PoS and morphological information (such as tense, gender, number) into account. These fine-grained rules make it a lot more accurate than the current lookup lemmatizer. It will be released soon!

Meanwhile, you can maybe use Stanford CoreNLP or FreeLing.

One option is to make your own lemmatizer.

This might sound frightening, but fear not! It is actually very simple to do one.

I've recently made a tutorial on how to make a lemmatizer, the link is here:

https://medium.com/analytics-vidhya/how-to-build-a-lemmatizer-7aeff7a1208c

As a summary, you'd have to:

  • Have a POS Tagger (you can use spaCy tagger) to tag input words.
  • Get a corpus of words and their lemmas - here, I suggest you download a Universal Dependencies Corpus for Spanish - just follow the steps in the tutorial mentioned above.
  • Create a lemma dict from the words extracted in the corpus.
  • Save the dict and make a wrapper function that receives both the word and its PoS.

In code, it'd look like this:

def lemmatize(word, pos):
   if word in dict:
      if pos in dict[word]:
          return dict[word][pos]
   return word

Simple, right?

In fact, simple lemmatization doesn't require a lot of processing as one would think. The hard part lies at PoS Tagging, but you have that for free. Either way, if you want to do Tagging yourself, you can see this other tutorial I made:

https://medium.com/analytics-vidhya/part-of-speech-tagging-what-when-why-and-how-9d250e634df6

Hope you get it solved.

You can use spacy-stanza. It has spaCy's API with the Stanza's models:

import stanza
from spacy_stanza import StanzaLanguage

text = "personas, ideas, cosas"

snlp = stanza.Pipeline(lang="es")
nlp = StanzaLanguage(snlp)
doc = nlp(text)
for token in doc:
    print(token.lemma_)

Maybe you can use FreeLing , this library offers, among many functionalities lemmatization in Spanish, Catalan, Basque, Italian and other languages.

In my experience, lemmatization in Spanish and Catalan is quite accurate and although it natively supports C++, it has an API for Python and another for Java.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM