简体   繁体   English

使用 spacy 从文档中删除命名实体

[英]Removing named entities from a document using spacy

I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example.我试图从被 spacy 认为是命名实体的文档中删除单词,因此基本上从字符串示例中删除了“Sweden”和“Nokia”。 I could not find a way to work around the problem that entities are stored as a span.我找不到解决实体存储为跨度的问题的方法。 So when comparing them with single tokens from a spacy doc, it prompts an error.因此,当将它们与 spacy doc 中的单个标记进行比较时,它会提示错误。

In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame.在后面的步骤中,这个过程应该是一个应用于存储在 pandas 数据框中的几个文本文档的函数。

I would appreciate any kind of help and advice on how to maybe better post questions as this is my first one here.我将不胜感激有关如何更好地发布问题的任何帮助和建议,因为这是我在这里的第一个问题。


nlp = spacy.load('en')

text_data = u'This is a text document that speaks about entities like Sweden and Nokia'

document = nlp(text_data)

text_no_namedentities = []

for word in document:
    if word not in document.ents:
        text_no_namedentities.append(word)

return " ".join(text_no_namedentities)

It creates the following error:它会产生以下错误:

TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span) TypeError:参数“其他”的类型不正确(预期为 spacy.tokens.token.Token,得到 spacy.tokens.span.Span)

This will not handle entities covering multiple tokens.这将不处理涵盖多个令牌的实体。

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)

text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

Output输出

'New York is in' '纽约在'

Here USA is correctly removed but couldn't eliminate New York这里USA被正确删除但无法消除New York

Solution解决方案

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))

Output输出

'is in' '在'

This will get you the result you're asking for.这将为您提供所需的结果。 Reviewing the Named Entity Recognition should help you going forward.查看命名实体识别应该可以帮助您继续前进。

import spacy

nlp = spacy.load('en_core_web_sm')

text_data = 'This is a text document that speaks about entities like Sweden and Nokia'

document = nlp(text_data)

text_no_namedentities = []

ents = [e.text for e in document.ents]
for item in document:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

Output:输出:

This is a text document that speaks about entities like and

You could use the entities attributes start_char and end_char to replace the entity by an empty string.您可以使用实体属性 start_char 和 end_char 将实体替换为空字符串。

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)

text_no_namedentities = []
ents = [(e.start_char,e.end_char)  for e in document.ents]

for ent in ents:
    start_char, end_char = ent
    text_data = text_data[:start_char] + text_data[end_char:]  
print(text_data)

I had issue with above solutions, kochar96 and APhillips's solution modifies the text, due to spacy's tokenization, so can't --> ca n't after the join.我对上述解决方案有疑问,kochar96 和 APhillips 的解决方案修改了文本,由于 spacy 的标记化,所以不能 --> 加入后不能。

I couldn't quite follow Batmobil's solution, but followed the general idea of using the start and end indices.我不能完全遵循 Batmobil 的解决方案,但遵循使用开始和结束索引的一般想法。

Explanation of the hack-y numpy solution in the printout.打印输出中 hack-y numpy 解决方案的说明。 (Don't have time to do something more reasonable, feel free to edit and improve) (没时间做更合理的事情,随意编辑改进)

text_data = "This can't be a text document that speaks about entities like Sweden and Nokia"
my_ents = [(e.start_char,e.end_char) for e in nlp(text_data).ents]
my_str = text_data

print(f'{my_ents=}')
idx_keep = [0] + np.array(my_ents).ravel().tolist() + [-1]
idx_keep = np.array(idx_keep).reshape(-1,2)
print(idx_keep)

keep_text = ''
for start_char, end_char in idx_keep:
    keep_text += my_str[start_char:end_char]
print(keep_text)
my_ents=[(62, 68), (73, 78)]
[[ 0 62]
 [68 73]
 [78 -1]]
This can't be a text document that speaks about entities like  and 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM