简体繁体 English

确定两个单词是否来自 Python 中的同一个词根

[英]Determining if two words are derived from the same root in Python

原文 2017-12-29 18:01:19 2 1 python/ nlp/ nltk/ wordnet

I'd like to write a function same_base(word1, word2) that returns True when word1 and word2 are two English words derived from the same root word.我想编写一个函数same_base(word1, word2) ，当word1和word2是源自同一个词根的两个英文单词时，它返回True 。 I realize that words can have multiple senses;我意识到单词可以有多种含义； I want the algorithm to be overzealous, returning True whenever it is possible to view the words as originating from the same place.我希望算法过于热心，只要有可能将单词视为来自同一个地方，就返回True 。 Some false positives are OK;一些误报是可以的； false negatives are not.假阴性不是。

Typically, stemming and lemmatization would be used for this.通常，词干提取和词形还原将用于此目的。 Here's what I've tried:这是我尝试过的：

Check if the words stem to the same thing, using, for instance, the Porter Stemmer.检查单词是否词干相同，例如使用 Porter Stemmer。 This doesn't catch sung and sing , dig and dug , medication and medicine .这不叫sung又sing ， dig又dug ， medication medicine 。
Check if the words lemmatize to the same thing.检查单词是否词形还原为同一事物。 It's unclear what arguments to pass to the lemmatizer (ie, for part of speech).目前尚不清楚将哪些参数传递给词形还原器（即词性）。 The WordNet lemmatizer, at least, seems to be too conservative.至少 WordNet lemmatizer 似乎太保守了。

Does such a tool exist?有这样的工具吗？ Do I just need an extremely aggressive stemmer / lemmatizer combo — and if so, where would I find one?我是否只需要一个非常激进的词干提取器/词形还原器组合——如果是这样，我在哪里可以找到一个？

1 个解决方案

The general task, as you've described it, is not possible from simple textual analysis of the input characters.正如您所描述的，一般任务不可能通过对输入字符的简单文本分析来实现。 English does not have consistent rules for handling words as they evolve.随着单词的发展，英语没有一致的处理单词的规则。 Yes, an excellent lemmatiser will solve the straightforward cases for you, those that can be discerned by applying transformations common within that POS (such as irregular verbs).是的，优秀的词形还原师会为您解决一些简单的案例，这些案例可以通过应用该 POS 中常见的转换（例如不规则动词）来辨别。

However, to eliminate false negatives, you must have complete coverage of the word's basis;但是，要消除漏报，您必须完全覆盖单词的基础； complete will require etymology, especially in cases where the root word isn't in the English language, or perhaps doesn't appear in the shortened word itself. complete将需要词源学，特别是在词根不在英语中，或者可能不出现在缩短词本身中的情况下。

For instance, what software tool could tell you that dis and speculum have the same root ( specere ), but that species does not?例如，什么软件工具可以告诉您dis和speculum具有相同的根（ specere ），但该species却没有？ How would you tell that gentle , gentile , genteel , and jaunty have the same root?你怎么知道gentle 、 gentile 、 genteel和jaunty有同一个词根？ You'll need the etymology to get 100% of the actual connections.您将需要词源来获得 100% 的实际联系。