简体   繁体   English

如何在Spacy NLP中添加新实体(ORG)实例

[英]How do I add new entity (ORG) instances in spacy nlp

I am trying to add stock symbols to the strings recognized as ORG entities. 我正在尝试将股票代码添加到公认的ORG实体的字符串中。 For each symbol, I do: 对于每个符号,我都会:

nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])

I can see that this symbol gets added to the patterns: 我可以看到此符号已添加到模式中:

print "Patterns:", nlp.matcher._patterns

but any symbols that were not recognized before adding are not recognized after adding. 但是在添加之前无法识别的任何符号在添加之后都无法识别。 Apparently, these tokens already exist in the vocabulary (that is why the vocab length does not change). 显然,这些标记已经存在于词汇表中(这就是词汇长度不变的原因)。

What should I be doing differently? 我应该怎么做? What am I missing? 我想念什么?

Thanks 谢谢

Here is my example code: 这是我的示例代码:

"Brief snippet to practice adding stock ticker symbols as ORG entities" “简要练习将股票行情代码添加为ORG实体的摘要”

from spacy.en import English
import spacy.en
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
import os
import csv
import sys

nlp = English()  #Load everything for the English model

print "Before nlp vocab length", len(nlp.matcher.vocab)

symbol_list = [u"CHK", u"JONE", u"NE", u"DO",  u"ESV"]

txt =  u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)"""# u"""Drive double-digit rallies in Chesapeake Energy (NYSE: CHK), Noble Corporation (NYSE:NE), Diamond Offshore (NYSE:DO), Ensco (NYSE:ESV), and Jones Energy (NYSE: JONE)"""
before = nlp(txt)
for tok in before:   #Before adding entities
    print tok, tok.orth, tok.tag_, tok.ent_type_

for symbol in symbol_list:
    print "adding symbol:", symbol
    print "vocab length:", len(nlp.matcher.vocab)
    print "pattern length:", nlp.matcher.n_patterns
    nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])


print "Patterns:", nlp.matcher._patterns
print "Entities:", nlp.matcher._entities
for ent in nlp.matcher._entities:
    print ent.label

tokens = nlp(txt)

print "\n\nAfter:"
print "After nlp vocab length", len(nlp.matcher.vocab)

for tok in tokens:
    print tok, tok.orth, tok.tag_, tok.ent_type_

Here's working example based on the docs : 这是基于docs的工作示例:

import spacy

nlp = spacy.load('en')

def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])

matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add(entity_key='stock-nyse', label='STOCK', attrs={}, specs=[[{spacy.attrs.ORTH: 'NYSE'}]], on_match=merge_phrases)
matcher.add(entity_key='stock-esv', label='STOCK', attrs={}, specs=[[{spacy.attrs.ORTH: 'ESV'}]], on_match=merge_phrases)
doc = nlp(u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)""")
matcher(doc)
print(['%s|%s' % (t.orth_, t.ent_type_) for t in doc])

-> - >

['drive|', 'double|', '-|', 'digit|', 'rallies|', 'in|', 'Chesapeake|ORG', 'Energy|ORG', '(|', 'NYSE|STOCK', ':|', 'CHK|', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'NE|GPE', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'DO|', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'ESV|STOCK', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'JONE|ORG', ')|']

NYSE and ESV now marked with STOCK entity type. NYSEESV现在已标记为STOCK实体类型。 Basically, on each match you should manually merge tokens and/or assign entity types you want. 基本上,每次匹配时,您都应该手动合并令牌和/或分配所需的实体类型。 There's also acceptor function which allows you to filter/reject the matches while they are being matched. 还有一个受体功能,允许您在匹配时过滤/拒绝匹配项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM