简体   繁体   中英

Determining if two words are derived from the same root in Python

I'd like to write a function same_base(word1, word2) that returns True when word1 and word2 are two English words derived from the same root word. I realize that words can have multiple senses; I want the algorithm to be overzealous, returning True whenever it is possible to view the words as originating from the same place. Some false positives are OK; false negatives are not.

Typically, stemming and lemmatization would be used for this. Here's what I've tried:

  • Check if the words stem to the same thing, using, for instance, the Porter Stemmer. This doesn't catch sung and sing , dig and dug , medication and medicine .
  • Check if the words lemmatize to the same thing. It's unclear what arguments to pass to the lemmatizer (ie, for part of speech). The WordNet lemmatizer, at least, seems to be too conservative.

Does such a tool exist? Do I just need an extremely aggressive stemmer / lemmatizer combo — and if so, where would I find one?

The general task, as you've described it, is not possible from simple textual analysis of the input characters. English does not have consistent rules for handling words as they evolve. Yes, an excellent lemmatiser will solve the straightforward cases for you, those that can be discerned by applying transformations common within that POS (such as irregular verbs).

However, to eliminate false negatives, you must have complete coverage of the word's basis; complete will require etymology, especially in cases where the root word isn't in the English language, or perhaps doesn't appear in the shortened word itself.

For instance, what software tool could tell you that dis and speculum have the same root ( specere ), but that species does not? How would you tell that gentle , gentile , genteel , and jaunty have the same root? You'll need the etymology to get 100% of the actual connections.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM