[英]How do I add new entity (ORG) instances in spacy nlp
I am trying to add stock symbols to the strings recognized as ORG entities. 我正在尝试将股票代码添加到公认的ORG实体的字符串中。 For each symbol, I do: 对于每个符号,我都会:
nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])
I can see that this symbol gets added to the patterns: 我可以看到此符号已添加到模式中:
print "Patterns:", nlp.matcher._patterns
but any symbols that were not recognized before adding are not recognized after adding. 但是在添加之前无法识别的任何符号在添加之后都无法识别。 Apparently, these tokens already exist in the vocabulary (that is why the vocab length does not change). 显然,这些标记已经存在于词汇表中(这就是词汇长度不变的原因)。
What should I be doing differently? 我应该怎么做? What am I missing? 我想念什么?
Thanks 谢谢
Here is my example code: 这是我的示例代码:
"Brief snippet to practice adding stock ticker symbols as ORG entities" “简要练习将股票行情代码添加为ORG实体的摘要”
from spacy.en import English
import spacy.en
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
import os
import csv
import sys
nlp = English() #Load everything for the English model
print "Before nlp vocab length", len(nlp.matcher.vocab)
symbol_list = [u"CHK", u"JONE", u"NE", u"DO", u"ESV"]
txt = u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)"""# u"""Drive double-digit rallies in Chesapeake Energy (NYSE: CHK), Noble Corporation (NYSE:NE), Diamond Offshore (NYSE:DO), Ensco (NYSE:ESV), and Jones Energy (NYSE: JONE)"""
before = nlp(txt)
for tok in before: #Before adding entities
print tok, tok.orth, tok.tag_, tok.ent_type_
for symbol in symbol_list:
print "adding symbol:", symbol
print "vocab length:", len(nlp.matcher.vocab)
print "pattern length:", nlp.matcher.n_patterns
nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])
print "Patterns:", nlp.matcher._patterns
print "Entities:", nlp.matcher._entities
for ent in nlp.matcher._entities:
print ent.label
tokens = nlp(txt)
print "\n\nAfter:"
print "After nlp vocab length", len(nlp.matcher.vocab)
for tok in tokens:
print tok, tok.orth, tok.tag_, tok.ent_type_
Here's working example based on the docs : 这是基于docs的工作示例:
import spacy
nlp = spacy.load('en')
def merge_phrases(matcher, doc, i, matches):
'''
Merge a phrase. We have to be careful here because we'll change the token indices.
To avoid problems, merge all the phrases once we're called on the last match.
'''
if i != len(matches)-1:
return None
spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
for ent_id, label, span in spans:
span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])
matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add(entity_key='stock-nyse', label='STOCK', attrs={}, specs=[[{spacy.attrs.ORTH: 'NYSE'}]], on_match=merge_phrases)
matcher.add(entity_key='stock-esv', label='STOCK', attrs={}, specs=[[{spacy.attrs.ORTH: 'ESV'}]], on_match=merge_phrases)
doc = nlp(u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)""")
matcher(doc)
print(['%s|%s' % (t.orth_, t.ent_type_) for t in doc])
-> - >
['drive|', 'double|', '-|', 'digit|', 'rallies|', 'in|', 'Chesapeake|ORG', 'Energy|ORG', '(|', 'NYSE|STOCK', ':|', 'CHK|', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'NE|GPE', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'DO|', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'ESV|STOCK', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'JONE|ORG', ')|']
NYSE
and ESV
now marked with STOCK
entity type. NYSE
和ESV
现在已标记为STOCK
实体类型。 Basically, on each match you should manually merge tokens and/or assign entity types you want. 基本上,每次匹配时,您都应该手动合并令牌和/或分配所需的实体类型。 There's also acceptor function which allows you to filter/reject the matches while they are being matched. 还有一个受体功能,允许您在匹配时过滤/拒绝匹配项。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.