Spacy (Python 3.10) token.lefts 方法錯誤地返回空列表

Question

各位 NLP 程序員，

一段時間以來，我一直遇到 Spacy (token).lefts 和 (token).rights 方法的問題。 當合並到我的代碼中時，它們傾向於（非常隨機地）返回空列表。 為了可視化手頭的問題，我在這里粘貼了一個相當簡單的 Python 代碼，用於從提供的文本中提取一些財務信息（此代碼僅用於說明和測試目的）。

import spacy
from spacy.tokens import token
nlp = spacy.load('en_core_web_sm')

def number_exctractor(doc):
    # Finds number in the provided text (if any) and extracts it together with the head_rights.
    # Should return: phrase, last_element.
    phrase = ''
    for token in doc:
        if token.pos_ == 'NUM':
            while True:
                phrase += token.text
                token = token.head
                if token not in list(token.head.lefts):
                    phrase += ' ' + token.text + '.'
                    return phrase, token
    return None, None
                    
def utility_builder(doc, phrase, token):
    # Iterates over head_lefts starting from the head of the last_element.
    # Stops at the ROOT.
    # Should return: phrase, last_element.
    while True:
        token = doc[token.i].head
        phrase = token.text + ' ' + phrase
        if token.pos_ == 'VERB':
            return phrase, token

**def nsubj_finder(doc, phrase, token):**
    # Iterates over head_lefts starting from the head of the last_element.
    # Searches for a nsubj, when found add [nsubj + nsubj.head.lefts to the phrase.
    # Should return: phrase.
    token = doc[token.i]
    for token in token.lefts:
        if token.dep_ == "nsubj":
            phrase = ' '.join([token.text for token in token.lefts]) + ' ' + token.text + ' ' + phrase
            return phrase

def document_searcher(doc):
    sentences = []
    for sent in doc.sents:
        phrase, last_element = number_exctractor(sent)
        if phrase != None:
            phrase, last_element = utility_builder(doc, phrase, last_element)
            phrase = nsubj_finder(doc, phrase, last_element)
            sentences.append(phrase)
    return sentences

**doc = nlp('''The company, whose profits reached a record high this year, largely attributed
to changes in management, earned a total revenue of $4.26 million.''')**
p = document_searcher(doc)
print(p)

這里的問題是nsubj_finder()中 token.lefts迭代中的for 令牌不成功，因為token.lefts返回空列表。 僅供對比，我嘗試在 Python idle 中使用此方法。 有時它返回空列表，有時它返回非空列表。 您知道什么可能導致這種行為嗎？

Answer 1

for i in doc:
  print(list(i.lefts))

使用 spacy 3.1.2返回此值，因此您需要嘗試使用其他模型，例如en_core_web_lg或其他版本，因為這些模型有時會失敗並給出奇怪的結果：

[]
[The]
[]
[]
[whose]
[profits]
[]
[]
[a, record]
[]
[this]
[]
[]
[company, largely]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[a, total]
[]
[]
[]
[$, 4.26]
[]

和：

for i in doc:
  print(i.rights)

返回：

[]
[,, reached]
[]
[]
[]
[high, year, ,]
[]
[]
[]
[]
[]
[]
[]
[
, to, ,, earned, .]
[]
[changes]
[in]
[management]
[]
[]
[revenue]
[]
[]
[of]
[million]
[]
[]
[]
[]

Answer 2

好的，感謝@Cardstdani，我已經弄清楚了。 token.lefts 和 token.rights 方法都使用解析器。 據我回憶（請注意，您可能需要仔細檢查文檔以確認此事）至少 en_core_web_lg 應該擁有解析器 - 但即使使用該模型我也遇到了同樣的問題。

為了解決這個問題，我必須安裝 en_core_web_trf - 如果需要更准確的模型，這是推薦的包（盡管請注意，它的尺寸要大得多，所以如果您的主要目標是部署例如輕量級應用程序）。

為了安裝 en_core_web_trf，我不得不降級到 Python 3.9（我使用的是已發布的 3.10）——這在你的環境中可能不是這樣，但在我的環境中，Python 3.10 正在創建一些包依賴問題（這可能是因為我以前有安裝 en_core_web_lg 和 en_core_web_sm - 但是，它不應該）。

Spacy (Python 3.10) token.lefts 方法錯誤地返回空列表

問題描述

2 個解決方案

解決方案1
0 2021-11-02 18:05:29

解決方案2
0 2021-11-03 11:37:54

Spacy (Python 3.10) token.lefts 方法錯誤地返回空列表

問題描述

2 個解決方案

解決方案1 0 2021-11-02 18:05:29

解決方案2 0 2021-11-03 11:37:54

解決方案1
0 2021-11-02 18:05:29

解決方案2
0 2021-11-03 11:37:54