简体   繁体   English

如何使用 spaCy 从句子中删除实体?

[英]How to remove an entity from a sentence with spaCy?

How to remove an entity from a sentence with spaCy?如何使用 spaCy 从句子中删除实体? I want to remove ORP, GPE, Money, Ordinal, or Percent entity randomly.我想随机删除 ORP、GPE、Money、Ordinal 或 Percent 实体。 For example,例如,

Donald John Trump[person] (born June 14, 1946)[date] is the 45th[ordinal] and current president of the United States[GPE].唐纳德·约翰·特朗普[人](生于 1946 年 6 月 14 日)[日期] 是美国第 45 任[序数]和现任总统[GPE]。 Before entering politics, he was a businessman and television personality.在进入政界之前,他是一名商人和电视名人。

Now how can I remove a certain entity form this sentence?现在我怎样才能从这句话中删除某个实体? In this example, the function chose to remove 45th, an ordinal entity.在此示例中,该函数选择删除第 45 个有序实体。

>>> sentence = 'Donald John Trump (born June 14, 1946) is the 45th and current president of the United States. Before entering politics, he was a businessman and television personality.'
>>> remove(sentence)
45th

Please try Spacy NER together with np.random.choice :请尝试Spacy NER 和np.random.choice

import spacy
nlp = spacy.load("en_core_web_md")

sentence = 'Donald John Trump (born June 14, 1946) is the 45th and current president of the United States. Before entering politics, he was a businessman and television personality.'
doc = nlp(sentence)

ents = [e.text for e in doc.ents if e.label_ in ("NORP", "GPE", "MONEY", "ORDINAL","PERCENT")]
remove = lambda x: str(np.random.choice(x))
# expected output
remove(ents)
'45th'

Should you wish to remove a random entity from sentence text:如果您希望从句子文本中删除随机实体:

def remove_from_sentence(sentence):
    doc = nlp(sentence)
    with doc.retokenize() as retokenizer:
        for e in doc.ents:
            retokenizer.merge(doc[e.start:e.end])
    tok_pairs = [(tok.text, tok.whitespace_) for tok in doc]
    ents = [e.text for e in doc.ents if e.label_ in ("NORP", "GPE", "MONEY", "ORDINAL","PERCENT")]
    ent_to_remove = remove(ents)
    print(ent_to_remove)
    tok_pairs_out = [pair for pair in tok_pairs if pair[0] != ent_to_remove]
    return "".join(np.array(tok_pairs_out).ravel())

remove_from_sentence(sentence)

the United States
'Donald John Trump (born June 14, 1946) is the 45th and current president of . Before entering politics, he was a businessman and television personality.'

Please ask if something is not clear.请询问是否有不清楚的地方。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM