简体   繁体   English

检测英语单词和nltk的单词语料库

[英]detect English words and nltk's words corpus

Just trying to see of a word is English or not. 试着看一个单词是英语与否。 This: 这个:

english_words = set(nltk.corpus.words.words())
print("revised" in english_words)

results in False. 结果是假的。 Am I doing something wrong? 难道我做错了什么? Is this to be expected? 这是预期的吗? Are there better ways of doing this? 有更好的方法吗? Thanks. 谢谢。

It seems that "revised" indeed is not in the wordlist: 似乎“修订”确实不在词汇表中:

import nltk

english_words = set(nltk.corpus.words.words())

for w in english_words:
    if w.startswith("revise"):
        print(w)

prints the following list: 打印以下列表:

reviser
revise
revisee
revisership

Based on this source , section 4.1, this is where the word list originates from: 根据这个来源 ,第4.1节,这是单词列表的起源地:

The Words Corpus is the /usr/share/dict/words file from Unix Words Corpus是来自Unix的/ usr / share / dict / words文件

So you'll have to decide for your use case if the provided word list from NLTK is enough or if you want to switch to a more complete (and bigger) one. 因此,如果从NLTK提供的单词列表足够,或者如果要切换到更完整(更大)的单词列表,则必须决定用例。

Try this 尝试这个

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM