簡體   English   中英

spaCy 短語匹配器:嘗試刪除匹配的短語時出現類型錯誤

[英]spaCy phrase matcher:TypeError when trying to remove matched phrases

我正在嘗試使用 spaCy 短語匹配器( https://spacy.io/usage/rule-based-matching#phrasematcher )有效地清理源自自動語音識別軟件的文本。 數據非常臟,並且沒有區分說話者,所以我試圖刪除所有數據樣本中的重復短語。 使用基於規則的短語匹配器,我能夠在示例字符串中找到目標文本,但在嘗試用空格替換它們時,我收到以下類型錯誤: TypeError: replace() argument 1 must be str, not spacy.tokens.token.Token

我的代碼如下:

# Import the required libraries:
import spacy
from spacy.matcher import PhraseMatcher

# Declare string from text extracted from a dataframe.  Please note that there are many errors in the ASR, including words recognized incorrectly such as "mercado" which a mis-translated utterance from the IVR.  

conv_str = "Welcome to companyx, to continue in English, please press one but I contin into mercado. Hello, I am V companyx, virtual assistant to best serve you during our conversation. Please provide your responses after I finished speaking in a few words please tell me what you're calling about. You can say something like I want to change my account information"

# call the matcher
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

# Declare a list of strings to search for in another string

terms = ["Welcome to CompanyX", "to continue in English, please press one", "virtual assistant", "In a few words please tell me what you're calling about", "CompanyX"]
# the stack overflow interface is incorrectly coloring some of the term strings, but it works in python

# create patterns variable
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp(conv_str)
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end] # span is a list
    terms_not_needed = list(span)
    for item in terms_not_needed:
        conv_str.replace(item, ' ')

正如我所提到的,我得到了上面打印的 TypeError。 我知道 str.replace 參數需要一個字符串,但我認為通過聲明 span 一個列表,我可以遍歷該terms_not_needed列表以進行單個字符串匹配。 任何指導都會非常有幫助。

您的方法在這里有幾個問題。 一是由於replace的工作方式,如果您使用它,則沒有理由使用 PhraseMatcher - replace已經替換了字符串的所有實例。

我會做的是使用on_match回調將自定義屬性(例如token._.ignore )設置為 True 以用於匹配器找到的任何內容。 然后,要獲取您感興趣的標記,您只需遍歷 Doc 並獲取該值不為 True 的每個標記。

這是執行此操作的代碼的修改版本:

# Import the required libraries:
import spacy
from spacy.tokens import Token
from spacy.matcher import PhraseMatcher

Token.set_extension("ignore", default=False)

# Declare string from text extracted from a dataframe.  Please note that there are many errors in the ASR, including words recognized incorrectly such as "mercado" which a mis-translated utterance from the IVR.  

conv_str = "Welcome to companyx, to continue in English, please press one but I contin into mercado. Hello, I am V companyx, virtual assistant to best serve you during our conversation. Please provide your responses after I finished speaking in a few words please tell me what you're calling about. You can say something like I want to change my account information"

nlp = spacy.blank("en")
# call the matcher
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")


def set_ignore(matcher, doc, id, matches):
    for _, start, end in matches:
        for tok in doc[start:end]:
            tok._.ignore = True

# Declare a list of strings to search for in another string

terms = ["Welcome to CompanyX", "to continue in English, please press one", "virtual assistant", "In a few words please tell me what you're calling about", "CompanyX"]
# the stack overflow interface is incorrectly coloring some of the term strings, but it works in python

# create patterns variable
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns, on_match=set_ignore)

doc = nlp(conv_str)
# this will run the callback
matcher(doc)

toks = [tok.text + tok.whitespace_ for tok in doc if not tok._.ignore]
print("".join(toks))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM