简体   繁体   English

为什么WordNet和JWI词干分析器由于“顺序”词干而给出“ ord”和“ orde”?

[英]Why WordNet and JWI stemmer gives “ord” and “orde” in result of “order” stemming?

I'm working on a project using WordNet and JWI 2.4.0. 我正在使用WordNet和JWI 2.4.0进行项目。 Currently, I'm putting a lot of words within the included stemmer, it seems to work, until I asked for "order". 目前,我在包含的词干分析器中加入了很多字眼,直到我要求输入“ order”为止,它似乎仍然有效。 The stemmer answers me that "order", "orde", and "ord", are the possible stems of "order". 词干回答我说“ order”,“ orde”和“ ord”是“ order”的可能词干。 I'm not a native english speaker, but... I never saw the word "ord" in my life... and when I asked the WordNet dictionary for this definition : obviously there is nothing. 我不是英语母语人士,但是...我一生中从未见过“ ord”一词...当我向WordNet词典询问此定义时:显然没有。 (in BabelNet online, I found that it is a Nebraska's town !) (在在线BabelNet中,我发现它是内布拉斯加州的小镇!)

Well, why is there this strange stem ? 好吧,为什么会有这个奇怪的茎? How can I filter the stems that are not present in the WordNet dictionary ? 如何过滤WordNet词典中不存在的词干? (because when I re-use the stemmed words, "orde" is making the program crash) (因为当我重复使用词干时,“ orde”使程序崩溃)

Thank you ! 谢谢 !

ANSWER : I didn't understood well what was a stem. 答案:我不太清楚什么是茎。 So, this question has no sense. 因此,这个问题毫无意义。

Here is some code to test : 这是一些要测试的代码:

package JWIExplorer;

import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.util.Arrays;
import java.util.Date;
import java.util.Iterator;
import java.util.List;

import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.morph.WordnetStemmer;

public class TestJWI
{

    public static void main(String[] args) throws IOException
    {
        List<String> WordList_Research = Arrays.asList("dog", "cat", "mouse");
        List<String> WordList_Research2 = Arrays.asList("order");

        String path = "./" + File.separator + "dict";
        URL url;

        url = new URL("file", null, path);

        System.out.println("BEGIN : " + new Date());

        for (Iterator<String> iterstr = WordList_Research2.iterator(); iterstr.hasNext();)
        {
            String str = iterstr.next();

            TestStem(url, str);
        }

        System.out.println("END : " + new Date());
    }

    public static void TestStem(URL url, String ResearchedWord) throws IOException
    {
        // construct the dictionary object and open it
        IDictionary dict = new Dictionary(url);
        dict.open();

        // First, let's check for the stem word
        WordnetStemmer Stemmer = new WordnetStemmer(dict);
        List<String> StemmedWords;

        // null for all words, POS.NOUN for nouns
        StemmedWords = Stemmer.findStems(ResearchedWord, null);
        if (StemmedWords.isEmpty())
            return;

        for (Iterator<String> iterstr = StemmedWords.iterator(); iterstr.hasNext();)
        {
            String str = iterstr.next();

            System.out.println("Local stemmed iteration on : " + str);
        }
    }

}

Stems do not necessarily need to be words by themselves. 词根不一定必须是单词。 "Order" and "Ordinal" share the stem "Ord". “订单”和“顺序”共享词干“顺序”。

The fundamental problem here is that stems are related to spelling, but language evolution and spelling are only weakly related ( especially in English). 这里的根本问题是,词干与拼写有关,但是语言的发展和拼写之间的联系很少( 尤其是英语)。 As a programmer, we'd much rather describe a stem as a regex, eg ^ord[ie] . 作为程序员,我们宁愿将茎描述为正则表达式,例如^ord[ie] This captures that it's not the stem of "ordained" 这表明它不是“受命”的词根

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM