spaCy 短語匹配器：嘗試刪除匹配的短語時出現類型錯誤

Question

我正在嘗試使用 spaCy 短語匹配器（ https://spacy.io/usage/rule-based-matching#phrasematcher ）有效地清理源自自動語音識別軟件的文本。 數據非常臟，並且沒有區分說話者，所以我試圖刪除所有數據樣本中的重復短語。 使用基於規則的短語匹配器，我能夠在示例字符串中找到目標文本，但在嘗試用空格替換它們時，我收到以下類型錯誤： TypeError: replace() argument 1 must be str, not spacy.tokens.token.Token

我的代碼如下：

# Import the required libraries:
import spacy
from spacy.matcher import PhraseMatcher

# Declare string from text extracted from a dataframe.  Please note that there are many errors in the ASR, including words recognized incorrectly such as "mercado" which a mis-translated utterance from the IVR.  

conv_str = "Welcome to companyx, to continue in English, please press one but I contin into mercado. Hello, I am V companyx, virtual assistant to best serve you during our conversation. Please provide your responses after I finished speaking in a few words please tell me what you're calling about. You can say something like I want to change my account information"

# call the matcher
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

# Declare a list of strings to search for in another string

terms = ["Welcome to CompanyX", "to continue in English, please press one", "virtual assistant", "In a few words please tell me what you're calling about", "CompanyX"]
# the stack overflow interface is incorrectly coloring some of the term strings, but it works in python

# create patterns variable
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp(conv_str)
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end] # span is a list
    terms_not_needed = list(span)
    for item in terms_not_needed:
        conv_str.replace(item, ' ')

正如我所提到的，我得到了上面打印的 TypeError。 我知道 str.replace 參數需要一個字符串，但我認為通過聲明 span 一個列表，我可以遍歷該terms_not_needed列表以進行單個字符串匹配。 任何指導都會非常有幫助。

Answer 1

您的方法在這里有幾個問題。 一是由於replace的工作方式，如果您使用它，則沒有理由使用 PhraseMatcher - replace已經替換了字符串的所有實例。

我會做的是使用on_match回調將自定義屬性（例如token._.ignore ）設置為 True 以用於匹配器找到的任何內容。 然后，要獲取您感興趣的標記，您只需遍歷 Doc 並獲取該值不為 True 的每個標記。

這是執行此操作的代碼的修改版本：

# Import the required libraries:
import spacy
from spacy.tokens import Token
from spacy.matcher import PhraseMatcher

Token.set_extension("ignore", default=False)

# Declare string from text extracted from a dataframe.  Please note that there are many errors in the ASR, including words recognized incorrectly such as "mercado" which a mis-translated utterance from the IVR.  

conv_str = "Welcome to companyx, to continue in English, please press one but I contin into mercado. Hello, I am V companyx, virtual assistant to best serve you during our conversation. Please provide your responses after I finished speaking in a few words please tell me what you're calling about. You can say something like I want to change my account information"

nlp = spacy.blank("en")
# call the matcher
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")


def set_ignore(matcher, doc, id, matches):
    for _, start, end in matches:
        for tok in doc[start:end]:
            tok._.ignore = True

# Declare a list of strings to search for in another string

terms = ["Welcome to CompanyX", "to continue in English, please press one", "virtual assistant", "In a few words please tell me what you're calling about", "CompanyX"]
# the stack overflow interface is incorrectly coloring some of the term strings, but it works in python

# create patterns variable
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns, on_match=set_ignore)

doc = nlp(conv_str)
# this will run the callback
matcher(doc)

toks = [tok.text + tok.whitespace_ for tok in doc if not tok._.ignore]
print("".join(toks))

spaCy 短語匹配器：嘗試刪除匹配的短語時出現類型錯誤

問題描述

1 個解決方案

解決方案1
0 2022-09-05 03:47:13

spaCy 短語匹配器：嘗試刪除匹配的短語時出現類型錯誤

問題描述

1 個解決方案

解決方案1 0 2022-09-05 03:47:13

解決方案1
0 2022-09-05 03:47:13