简体   繁体   English

Java Stanford NLP:拼写检查

[英]Java Stanford NLP: Spell checking

I'm trying to check spelling accuracy of text samples using the Stanford NLP. 我正在尝试使用斯坦福NLP检查文本样本的拼写准确性。 It's just a metric of the text, not a filter or anything, so if it's off by a bit it's fine, as long as the error is uniform. 它只是文本的一个度量标准,而不是过滤器或任何东西,所以只要错误是一致的,如果它稍微关闭它就没问题了。

My first idea was to check if the word is known by the lexicon: 我的第一个想法是检查词典是否知道这个词:

private static LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");

@Analyze(weight=25, name="Spelling")
    public double spelling() {
        int result = 0;

        for (List<? extends HasWord> list : sentences) {
            for (HasWord w : list) {
                if (! lp.getLexicon().isKnown(w.word())) {
                    System.out.format("misspelled: %s\n", w.word());
                    result++;
                }
            }
        }

        return result / sentences.size();
    }

However, this produces quite a lot of false positives: 但是,这会产生很多误报:

misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus
misspelled: Camus
misspelled: foandf
misspelled: foandf
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: Camus
misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus

Any ideas on how to do this better? 关于如何做得更好的任何想法?

Using the parser's lexicon's isKnown(String) method as a spellchecker isn't a viable use case of the parser. 使用解析器的词典的isKnown(String)方法作为拼写检查器不是解析器的可行用例。 The method is correct: "false" means that this word was not seen (with the given capitalization) in the approximately 1 million words of text the parser is trained from. 方法是正确的:“false”表示在解析器训练的大约100万字的文本中没有看到这个单词(使用给定的大小写)。 But 1 million words just isn't enough text to train a comprehensive spellchecker from in a data-driven manner. 但是100万字只是不足以用数据驱动方式训练综合拼写检查器的文本。 People would typically use at least two orders of magnitude of text more, and might well add some cleverness to handle capitalization. 人们通常会使用至少两个数量级的文本,并且可能会增加一些聪明才能处理大写。 The parser includes some of this cleverness to handle words that were unseen in the training data, but this isn't reflected in what the isKnown(String) method returns. 解析器包含一些聪明性来处理训练数据中看不到的单词,但这并未反映在isKnown(String)方法返回的内容中。

It looks like your answer/errors are divided between proper names, real words (which I assume don't exist in the lexicon) and true misspellings. 看起来你的答案/错误分为正确的名称,真实的单词(我假设在词典中不存在)和真正的拼写错误。 A false negative on "Sincerity" also suggests that capitalization might be throwing it off, though you'd hope it'd be smart enough not to - worth checking anyway. 对“诚意”的虚假否定也表明资本化可能会把它抛弃,尽管你希望它足够聪明,不值得检查。 Plurals shouldn't be an issue either, but a false negative on "gods"? 多元不应该是一个问题,但对“神”的假阴性? Does it correctly identify "god"? 它是否正确识别“上帝”?

Since you're trying to check spelling, why check it indirectly? 既然您正在尝试检查拼写,为什么要间接检查它? what is lp.getLexicon().isKnown(w.word()) doing internally? 什么是lp.getLexicon()。isKnown(w.word())在内部做什么? doesn't it depend on the loaded corpus? 它不依赖于加载的语料库? Why not just load a dictionary, normalize the case into a big hash, and do a "contains" check? 为什么不加载字典,将案例规范化为大哈希,并进行“包含”检查? Since you're in an NLP context, it should also be reasonably easy to strip out proper names, especially given that you're not looking for 100% accuracy. 由于您处于NLP环境中,因此剥离专有名称也应该相当容易,特别是考虑到您不是在寻找100%的准确性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM