繁体   English   中英

如何删除从 Spacy 收集的所有“ORG”实体

[英]How to remove all the 'ORG' entities collected from Spacy

我正在研究NLP项目并使用Spacy 现在,我已经使用 Spacy 的NER识别了不同的实体,并且我想从原始输入字符串中删除 ORG(那些被识别为组织的实体)。

doc = "I'm here with the three of Nikkei Asia's stalwart editors, three Brits in Tokyo. First off, we have Michael Peel, who is executive editor, a journalist from our affiliate, The Financial Times  . He is now in Tokyo but has previously reported from the likes of Brussels, Bangkok, Abu Dhabi  and Lagos. Welcome, Michael.MICHAEL PEEL, EXECUTIVE EDITOR: Welcome Waj. Thank you very much.KHAN: All right. And we have Stephen Foley, our business editor who, like Michael, is on secondment from the FT, where he was deputy U.S. News Editor. Prior to the FT, he was a reporter at The Independent and like Michael, he's a fresh-off-the-boat arrival in Tokyo and has left some pretty big shoes to fill in the New York bureau, where we miss him. Welcome, Stephen.STEPHEN FOLEY, BUSINESS EDITOR: Thanks for having me, Waj.KHAN: Alright, and last but certainly not least, my brother in arms when it comes to cricket commentary across the high seas is Andy Sharp, or deputy editor who joined Nikkei Asia nearly four years ago, after a long stint at Bloomberg  in Tokyo and other esteemed Japanese publications. Welcome, Andy.ANDREW SHARP"

text = NER(doc)
org_stopwords = [ent.text for ent in text.ents if ent.label_ == 'ORG']

output of org_stopwords

['The Financial Times  ', 'Abu Dhabi  and Lagos', 'Bloomberg  ']

这是我现在的代码,我已经识别并列出了 Spacy 识别为 ORG 的所有内容,但现在我不知道如何从字符串中删除这些内容。 我面临的一个问题是通常拆分字符串并删除org_stopwords是因为 org_stopwords ar n-grams 请帮助一些编码示例如何解决这个问题。

使用正则表达式而不是替换

   import re
   org_stopwords =    ['The Financial Times',
                         'Abu Dhabi  ',
                         'U.S. News Editor',
                         'Independent',
                         'ANDREW']

   regex = re.compile('|'.join(org_stopwords))
   new_doc = re.sub(regex, '', doc) 
   

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM