简体   繁体   English

如何删除从 Spacy 收集的所有“ORG”实体

[英]How to remove all the 'ORG' entities collected from Spacy

I'm working on an NLP project and using Spacy .我正在研究NLP项目并使用Spacy Now, I have identified different entities using NER of Spacy, and I want to remove the ORG (those identified as Organisations) from the original input string.现在,我已经使用 Spacy 的NER识别了不同的实体,并且我想从原始输入字符串中删除 ORG(那些被识别为组织的实体)。

doc = "I'm here with the three of Nikkei Asia's stalwart editors, three Brits in Tokyo. First off, we have Michael Peel, who is executive editor, a journalist from our affiliate, The Financial Times  . He is now in Tokyo but has previously reported from the likes of Brussels, Bangkok, Abu Dhabi  and Lagos. Welcome, Michael.MICHAEL PEEL, EXECUTIVE EDITOR: Welcome Waj. Thank you very much.KHAN: All right. And we have Stephen Foley, our business editor who, like Michael, is on secondment from the FT, where he was deputy U.S. News Editor. Prior to the FT, he was a reporter at The Independent and like Michael, he's a fresh-off-the-boat arrival in Tokyo and has left some pretty big shoes to fill in the New York bureau, where we miss him. Welcome, Stephen.STEPHEN FOLEY, BUSINESS EDITOR: Thanks for having me, Waj.KHAN: Alright, and last but certainly not least, my brother in arms when it comes to cricket commentary across the high seas is Andy Sharp, or deputy editor who joined Nikkei Asia nearly four years ago, after a long stint at Bloomberg  in Tokyo and other esteemed Japanese publications. Welcome, Andy.ANDREW SHARP"

text = NER(doc)
org_stopwords = [ent.text for ent in text.ents if ent.label_ == 'ORG']

output of org_stopwords

['The Financial Times  ', 'Abu Dhabi  and Lagos', 'Bloomberg  ']

This is my code now, I've identified and made a list with all those identified as ORG by Spacy, but now I don't know how to remove those from the string.这是我现在的代码,我已经识别并列出了 Spacy 识别为 ORG 的所有内容,但现在我不知道如何从字符串中删除这些内容。 One problem I'm facing to normally split the string and remove the org_stopwords is beacause org_stopwords ar n-grams .我面临的一个问题是通常拆分字符串并删除org_stopwords是因为 org_stopwords ar n-grams Please help with some coded example how to tackle this issue.请帮助一些编码示例如何解决这个问题。

Use regex instead of replace使用正则表达式而不是替换

   import re
   org_stopwords =    ['The Financial Times',
                         'Abu Dhabi  ',
                         'U.S. News Editor',
                         'Independent',
                         'ANDREW']

   regex = re.compile('|'.join(org_stopwords))
   new_doc = re.sub(regex, '', doc) 
   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM