检测英语单词和nltk的单词语料库

Question

Just trying to see of a word is English or not. 试着看一个单词是英语与否。 This: 这个：

english_words = set(nltk.corpus.words.words())
print("revised" in english_words)

results in False. 结果是假的。 Am I doing something wrong? 难道我做错了什么？ Is this to be expected? 这是预期的吗？ Are there better ways of doing this? 有更好的方法吗？ Thanks. 谢谢。

Answer 1

It seems that "revised" indeed is not in the wordlist: 似乎“修订”确实不在词汇表中：

import nltk

english_words = set(nltk.corpus.words.words())

for w in english_words:
    if w.startswith("revise"):
        print(w)

prints the following list: 打印以下列表：

reviser
revise
revisee
revisership

Based on this source , section 4.1, this is where the word list originates from: 根据这个来源，第4.1节，这是单词列表的起源地：

The Words Corpus is the /usr/share/dict/words file from Unix Words Corpus是来自Unix的/ usr / share / dict / words文件

So you'll have to decide for your use case if the provided word list from NLTK is enough or if you want to switch to a more complete (and bigger) one. 因此，如果从NLTK提供的单词列表足够，或者如果要切换到更完整（更大）的单词列表，则必须决定用例。

Answer 2

Try this 尝试这个

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

检测英语单词和nltk的单词语料库

问题描述

2 个解决方案

解决方案1
2 2019-02-07 13:52:14

解决方案2
1 2019-02-07 13:47:44

检测英语单词和nltk的单词语料库

问题描述

2 个解决方案

解决方案1 2 2019-02-07 13:52:14

解决方案2 1 2019-02-07 13:47:44

解决方案1
2 2019-02-07 13:52:14

解决方案2
1 2019-02-07 13:47:44