簡體   English   中英

如何刪除從 Spacy 收集的所有“ORG”實體

[英]How to remove all the 'ORG' entities collected from Spacy

我正在研究NLP項目並使用Spacy 現在,我已經使用 Spacy 的NER識別了不同的實體,並且我想從原始輸入字符串中刪除 ORG(那些被識別為組織的實體)。

doc = "I'm here with the three of Nikkei Asia's stalwart editors, three Brits in Tokyo. First off, we have Michael Peel, who is executive editor, a journalist from our affiliate, The Financial Times  . He is now in Tokyo but has previously reported from the likes of Brussels, Bangkok, Abu Dhabi  and Lagos. Welcome, Michael.MICHAEL PEEL, EXECUTIVE EDITOR: Welcome Waj. Thank you very much.KHAN: All right. And we have Stephen Foley, our business editor who, like Michael, is on secondment from the FT, where he was deputy U.S. News Editor. Prior to the FT, he was a reporter at The Independent and like Michael, he's a fresh-off-the-boat arrival in Tokyo and has left some pretty big shoes to fill in the New York bureau, where we miss him. Welcome, Stephen.STEPHEN FOLEY, BUSINESS EDITOR: Thanks for having me, Waj.KHAN: Alright, and last but certainly not least, my brother in arms when it comes to cricket commentary across the high seas is Andy Sharp, or deputy editor who joined Nikkei Asia nearly four years ago, after a long stint at Bloomberg  in Tokyo and other esteemed Japanese publications. Welcome, Andy.ANDREW SHARP"

text = NER(doc)
org_stopwords = [ent.text for ent in text.ents if ent.label_ == 'ORG']

output of org_stopwords

['The Financial Times  ', 'Abu Dhabi  and Lagos', 'Bloomberg  ']

這是我現在的代碼,我已經識別並列出了 Spacy 識別為 ORG 的所有內容,但現在我不知道如何從字符串中刪除這些內容。 我面臨的一個問題是通常拆分字符串並刪除org_stopwords是因為 org_stopwords ar n-grams 請幫助一些編碼示例如何解決這個問題。

使用正則表達式而不是替換

   import re
   org_stopwords =    ['The Financial Times',
                         'Abu Dhabi  ',
                         'U.S. News Editor',
                         'Independent',
                         'ANDREW']

   regex = re.compile('|'.join(org_stopwords))
   new_doc = re.sub(regex, '', doc) 
   

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM