无意义的空间名词

Question

我正在使用 Spacy 从句子中提取名词。 这些句子在语法上很差，也可能包含一些拼写错误。

这是我正在使用的代码：

代码

import spacy
import re

nlp = spacy.load("en_core_web_sm")

sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")

doc= nlp(cleanString)

for token in doc:
    if token.pos_=="NOUN":
        print (token.text)

Output：

sfx

同样对于句子“fast foward2”，我得到 Spacy 名词为

foward2

这表明这些名词有一些无意义的词，如：sfx、foward2、ms、64x、bit、pwm、r、brailledisplayfastmovement等。

我只想保留包含有意义的单词名词的短语，例如 broom、ticker、pool、highway 等。

我已经尝试过 Wordnet 来过滤 wordnet 和 spacy 之间的常用名词，但它有点严格并且过滤了一些合理的名词。 例如，它过滤诸如motorbike、whoosh、trolley、metal、suitcase、zip等名词

因此，我正在寻找一种解决方案，在该解决方案中，我可以从我获得的 spacy 名词列表中过滤掉最明智的名词。

Answer 1

看来您可以使用pyenchant库：

Enchant 用于检查单词的拼写并建议对拼写错误的单词进行更正。 它可以使用许多流行的拼写检查包来执行此任务，包括 ispell、aspell 和 MySpell。 它在处理多种字典和多种语言方面非常灵活。

更多信息请访问 Enchant 网站：

https://abiword.github.io/enchant/

样品 Python 代码：

import spacy, re
import enchant                        #pip install pyenchant

d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")

sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex

doc= nlp(cleanString)
for token in doc:
    if token.pos_=="NOUN" and d.check(token.text):
        print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]

无意义的空间名词

问题描述

1 个解决方案

解决方案1
5 已采纳 2021-03-22 21:36:21

无意义的空间名词

问题描述

1 个解决方案

解决方案1 5 已采纳 2021-03-22 21:36:21

解决方案1
5 已采纳 2021-03-22 21:36:21