簡體   English   中英

如何將 spacy doc 轉換為嵌套的令牌列表

[英]How to turn spacy doc into nested list of tokens

我使用 spacy 和 stanfordnlp 進行依賴解析,我得到了一個 spacy 文檔。 我怎么能把那個文檔變成嵌套列表,其中每個子列表都包含頭的子標記

以下是您所問問題的一般解決方案,盡管包括輸入,預期 output,示例代碼將有助於確保此答案是相關的。 評論中提供了解釋。

import spacy

# Load relevant language/pipeline: here, the built-in small English web-based
# model.
nlp = spacy.load("en_core_web_sm")

# Run text through pipeline to create annotated doc.
sample_text = "Colorless green ideas sleep furiously."
doc = nlp(sample_text)

# Iterate through each token (t) in the doc object, and create a nested list
# of the children of each token. Keep in mind that like many spaCy attributes,
# token.children returns a generator. To access all of its elements at once,
# you will have to convert this generator into an object of type list.
child_list = [list(t.children) for t in doc]

# Now as an exercise, print out each token and check to see if you get the
# children you expected. Normally you would want to iterate on the objects 
# themselves -- we only use range() here for purposes of illustration.
for i in range(len(doc)):
    print("  token {}: {}".format(i + 1, doc[i]))
    print("    children: {}\n".format(child_list[i]))

根據問題的要求,output 是子令牌列表的列表。 請注意,雖然您的終端會像文本一樣顯示每個標記,但這些標記不僅僅是文本; 它們是 spaCy token對象,每個都根據doc中的注釋加載了語言信息。 output 將如下所示。

$ python example.py
  token 1: Colorless
    children: []
  token 2: green
    children: []
  token 3: ideas
    children: [Colorless, green]
  token 4: sleep
    children: [ideas, furiously, .]
  token 5: furiously
    children: []
  token 6: .
    children: []

這正是我們所期望的:

“無色的綠色想法瘋狂地沉睡”的 spaCy 依賴解析。

這是示例:

class Sent2Struct(object):

    def root(self,doc):
        for word in doc :
            if word.dep_ == 'ROOT' : return word

    def lol(self,root) :
        if len(list(root.children)) == 0 : return root.text
        childs = [ self.lol(child) for child in root.children ]
        return [root.text] + childs 



   In [100]: print( ss.lol(ss.root(nlp('the box is on the table'))) )                                                                                                           
   ['is', ['box', 'the'], ['on', ['table', 'the']]]

IE

   is(box(the), on(table(the)) )

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM