[英]How to turn spacy doc into nested list of tokens
我使用 spacy 和 stanfordnlp 進行依賴解析,我得到了一個 spacy 文檔。 我怎么能把那個文檔變成嵌套列表,其中每個子列表都包含頭的子標記
以下是您所問問題的一般解決方案,盡管包括輸入,預期 output,示例代碼將有助於確保此答案是相關的。 評論中提供了解釋。
import spacy
# Load relevant language/pipeline: here, the built-in small English web-based
# model.
nlp = spacy.load("en_core_web_sm")
# Run text through pipeline to create annotated doc.
sample_text = "Colorless green ideas sleep furiously."
doc = nlp(sample_text)
# Iterate through each token (t) in the doc object, and create a nested list
# of the children of each token. Keep in mind that like many spaCy attributes,
# token.children returns a generator. To access all of its elements at once,
# you will have to convert this generator into an object of type list.
child_list = [list(t.children) for t in doc]
# Now as an exercise, print out each token and check to see if you get the
# children you expected. Normally you would want to iterate on the objects
# themselves -- we only use range() here for purposes of illustration.
for i in range(len(doc)):
print(" token {}: {}".format(i + 1, doc[i]))
print(" children: {}\n".format(child_list[i]))
根據問題的要求,output 是子令牌列表的列表。 請注意,雖然您的終端會像文本一樣顯示每個標記,但這些標記不僅僅是文本; 它們是 spaCy token
對象,每個都根據doc
中的注釋加載了語言信息。 output 將如下所示。
$ python example.py
token 1: Colorless
children: []
token 2: green
children: []
token 3: ideas
children: [Colorless, green]
token 4: sleep
children: [ideas, furiously, .]
token 5: furiously
children: []
token 6: .
children: []
這正是我們所期望的:
這是示例:
class Sent2Struct(object):
def root(self,doc):
for word in doc :
if word.dep_ == 'ROOT' : return word
def lol(self,root) :
if len(list(root.children)) == 0 : return root.text
childs = [ self.lol(child) for child in root.children ]
return [root.text] + childs
In [100]: print( ss.lol(ss.root(nlp('the box is on the table'))) )
['is', ['box', 'the'], ['on', ['table', 'the']]]
IE
is(box(the), on(table(the)) )
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.