简体   繁体   English

使用Spacy进行令牌化-如何获取左右令牌

[英]Tokenisation with Spacy - how to get left and right tokens

I am using Spacy for text tokenization and getting stuck with it: 我正在使用Spacy进行文本标记化并陷入困境:

import spacy
nlp = spacy.load("en_core_web_sm")
mytext = "This is some sentence that spacy will not appreciate"
doc = nlp(mytext)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

returns something that seems to me to say that tokenisation was succesful: 返回的内容在我看来是说令牌化成功了:

This this DET DT nsubj Xxxx True False 
is be VERB VBZ ROOT xx True True 
some some DET DT det xxxx True True 
sentence sentence NOUN NN attr xxxx True False 
that that ADP IN mark xxxx True True 
spacy spacy NOUN NN nsubj xxxx True False 
will will VERB MD aux xxxx True True 
not not ADV RB neg xxx True True 
appreciate appreciate VERB VB ccomp xxxx True False

but on the other hand 但另一方面

[token.text for token in doc[2].lefts]

returns an empty list. 返回一个空列表。 Is there a bug in lefts/rights ? lefts/rights是否有错误?

Beginner at natural language processing, hope I am not falling into a conceptual trap. 自然语言处理的初学者,希望我不要陷入概念上的陷阱。 Using Spacy v'2.0.4'. 使用Spacy v'2.0.4'。

This is what the dependencies of that sentence look like: 这句话的依赖性如下所示:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"This is some sentence that spacy will not appreciate")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])
This nsubj is VERB []
is ROOT is VERB [This, sentence]
some det sentence NOUN []
sentence attr is VERB [some, appreciate]
that mark appreciate VERB []
spacy nsubj appreciate VERB []
will aux appreciate VERB []
not neg appreciate VERB []
appreciate relcl sentence NOUN [that, spacy, will, not]

So we see that doc[2] ("some") has an empty child vector. 因此,我们看到doc[2] (“ some”)具有空子向量。 However "is" ( doc[1] ) does not. 但是,“ is”( doc[1] )不是。 If we instead run... 如果我们改为跑步...

print([token.text for token in doc[1].lefts])  
print([token.text for token in doc[1].rights])  

we get... 我们得到...

['This']
['sentence']

The functions you are using navigate the dependency tree, not the document, hence why you are getting empty results for some words. 您正在使用的功能导航依赖关系树,而不是文档,因此为什么您对某些单词会得到空结果。

If you just want previous and following tokens, you can just do something like... 如果您只想要上一个和下一个令牌,则可以执行以下操作:

for ix, token in enumerate(doc):
    if ix == 0:
        print('Previous: %s, Current: %s, Next: %s' % ('', doc[ix], doc[ix + 1]))
    elif ix == (len(doc) - 1):
        print('Previous: %s, Current: %s, Next: %s' % (doc[ix - 1], doc[ix], ''))
    else:
        print('Previous: %s, Current: %s, Next: %s' % (doc[ix - 1], doc[ix], doc[ix + 1]))

or... 要么...

for ix, token in enumerate(doc):
    print('Previous: %s' % doc[:ix])
    print('Current: %s' % doc[ix])
    print('Following: %s' % doc[ix:])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM