简体   繁体   中英

How to remove all the 'ORG' entities collected from Spacy

I'm working on an NLP project and using Spacy . Now, I have identified different entities using NER of Spacy, and I want to remove the ORG (those identified as Organisations) from the original input string.

doc = "I'm here with the three of Nikkei Asia's stalwart editors, three Brits in Tokyo. First off, we have Michael Peel, who is executive editor, a journalist from our affiliate, The Financial Times  . He is now in Tokyo but has previously reported from the likes of Brussels, Bangkok, Abu Dhabi  and Lagos. Welcome, Michael.MICHAEL PEEL, EXECUTIVE EDITOR: Welcome Waj. Thank you very much.KHAN: All right. And we have Stephen Foley, our business editor who, like Michael, is on secondment from the FT, where he was deputy U.S. News Editor. Prior to the FT, he was a reporter at The Independent and like Michael, he's a fresh-off-the-boat arrival in Tokyo and has left some pretty big shoes to fill in the New York bureau, where we miss him. Welcome, Stephen.STEPHEN FOLEY, BUSINESS EDITOR: Thanks for having me, Waj.KHAN: Alright, and last but certainly not least, my brother in arms when it comes to cricket commentary across the high seas is Andy Sharp, or deputy editor who joined Nikkei Asia nearly four years ago, after a long stint at Bloomberg  in Tokyo and other esteemed Japanese publications. Welcome, Andy.ANDREW SHARP"

text = NER(doc)
org_stopwords = [ent.text for ent in text.ents if ent.label_ == 'ORG']

output of org_stopwords

['The Financial Times  ', 'Abu Dhabi  and Lagos', 'Bloomberg  ']

This is my code now, I've identified and made a list with all those identified as ORG by Spacy, but now I don't know how to remove those from the string. One problem I'm facing to normally split the string and remove the org_stopwords is beacause org_stopwords ar n-grams . Please help with some coded example how to tackle this issue.

Use regex instead of replace

   import re
   org_stopwords =    ['The Financial Times',
                         'Abu Dhabi  ',
                         'U.S. News Editor',
                         'Independent',
                         'ANDREW']

   regex = re.compile('|'.join(org_stopwords))
   new_doc = re.sub(regex, '', doc) 
   

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM