使用 spacy 从文档中删除命名实体

Question

I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example.我试图从被 spacy 认为是命名实体的文档中删除单词，因此基本上从字符串示例中删除了“Sweden”和“Nokia”。 I could not find a way to work around the problem that entities are stored as a span.我找不到解决实体存储为跨度的问题的方法。 So when comparing them with single tokens from a spacy doc, it prompts an error.因此，当将它们与 spacy doc 中的单个标记进行比较时，它会提示错误。

In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame.在后面的步骤中，这个过程应该是一个应用于存储在 pandas 数据框中的几个文本文档的函数。

I would appreciate any kind of help and advice on how to maybe better post questions as this is my first one here.我将不胜感激有关如何更好地发布问题的任何帮助和建议，因为这是我在这里的第一个问题。


nlp = spacy.load('en')

text_data = u'This is a text document that speaks about entities like Sweden and Nokia'

document = nlp(text_data)

text_no_namedentities = []

for word in document:
    if word not in document.ents:
        text_no_namedentities.append(word)

return " ".join(text_no_namedentities)

It creates the following error:它会产生以下错误：

TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span) TypeError：参数“其他”的类型不正确（预期为 spacy.tokens.token.Token，得到 spacy.tokens.span.Span）

Answer 1

This will not handle entities covering multiple tokens.这将不处理涵盖多个令牌的实体。

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)

text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

Output输出

'New York is in' '纽约在'

Here USA is correctly removed but couldn't eliminate New York这里USA被正确删除但无法消除New York

Solution解决方案

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))

Output输出

'is in' '在'

Answer 2

This will get you the result you're asking for.这将为您提供所需的结果。 Reviewing the Named Entity Recognition should help you going forward.查看命名实体识别应该可以帮助您继续前进。

import spacy

nlp = spacy.load('en_core_web_sm')

text_data = 'This is a text document that speaks about entities like Sweden and Nokia'

document = nlp(text_data)

text_no_namedentities = []

ents = [e.text for e in document.ents]
for item in document:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

Output:输出：

This is a text document that speaks about entities like and

Answer 3

You could use the entities attributes start_char and end_char to replace the entity by an empty string.您可以使用实体属性 start_char 和 end_char 将实体替换为空字符串。

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)

text_no_namedentities = []
ents = [(e.start_char,e.end_char)  for e in document.ents]

for ent in ents:
    start_char, end_char = ent
    text_data = text_data[:start_char] + text_data[end_char:]  
print(text_data)

Answer 4

I had issue with above solutions, kochar96 and APhillips's solution modifies the text, due to spacy's tokenization, so can't --> ca n't after the join.我对上述解决方案有疑问，kochar96 和 APhillips 的解决方案修改了文本，由于 spacy 的标记化，所以不能 --> 加入后不能。

I couldn't quite follow Batmobil's solution, but followed the general idea of using the start and end indices.我不能完全遵循 Batmobil 的解决方案，但遵循使用开始和结束索引的一般想法。

Explanation of the hack-y numpy solution in the printout.打印输出中 hack-y numpy 解决方案的说明。 (Don't have time to do something more reasonable, feel free to edit and improve) （没时间做更合理的事情，随意编辑改进）

text_data = "This can't be a text document that speaks about entities like Sweden and Nokia"
my_ents = [(e.start_char,e.end_char) for e in nlp(text_data).ents]
my_str = text_data

print(f'{my_ents=}')
idx_keep = [0] + np.array(my_ents).ravel().tolist() + [-1]
idx_keep = np.array(idx_keep).reshape(-1,2)
print(idx_keep)

keep_text = ''
for start_char, end_char in idx_keep:
    keep_text += my_str[start_char:end_char]
print(keep_text)

my_ents=[(62, 68), (73, 78)]
[[ 0 62]
 [68 73]
 [78 -1]]
This can't be a text document that speaks about entities like  and

使用 spacy 从文档中删除命名实体

问题描述

4 个解决方案

解决方案1
4 2020-07-08 10:28:51

解决方案2
3 已采纳 2019-12-12 22:55:22

解决方案3
1 2021-02-07 19:04:37

解决方案4
0 2022-06-02 21:33:02

使用 spacy 从文档中删除命名实体

问题描述

4 个解决方案

解决方案1 4 2020-07-08 10:28:51

解决方案2 3 已采纳 2019-12-12 22:55:22

解决方案3 1 2021-02-07 19:04:37

解决方案4 0 2022-06-02 21:33:02

解决方案1
4 2020-07-08 10:28:51

解决方案2
3 已采纳 2019-12-12 22:55:22

解决方案3
1 2021-02-07 19:04:37

解决方案4
0 2022-06-02 21:33:02