[英]spaCy phrase matcher:TypeError when trying to remove matched phrases
我正在嘗試使用 spaCy 短語匹配器( https://spacy.io/usage/rule-based-matching#phrasematcher )有效地清理源自自動語音識別軟件的文本。 數據非常臟,並且沒有區分說話者,所以我試圖刪除所有數據樣本中的重復短語。 使用基於規則的短語匹配器,我能夠在示例字符串中找到目標文本,但在嘗試用空格替換它們時,我收到以下類型錯誤: TypeError: replace() argument 1 must be str, not spacy.tokens.token.Token
我的代碼如下:
# Import the required libraries:
import spacy
from spacy.matcher import PhraseMatcher
# Declare string from text extracted from a dataframe. Please note that there are many errors in the ASR, including words recognized incorrectly such as "mercado" which a mis-translated utterance from the IVR.
conv_str = "Welcome to companyx, to continue in English, please press one but I contin into mercado. Hello, I am V companyx, virtual assistant to best serve you during our conversation. Please provide your responses after I finished speaking in a few words please tell me what you're calling about. You can say something like I want to change my account information"
# call the matcher
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
# Declare a list of strings to search for in another string
terms = ["Welcome to CompanyX", "to continue in English, please press one", "virtual assistant", "In a few words please tell me what you're calling about", "CompanyX"]
# the stack overflow interface is incorrectly coloring some of the term strings, but it works in python
# create patterns variable
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)
doc = nlp(conv_str)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end] # span is a list
terms_not_needed = list(span)
for item in terms_not_needed:
conv_str.replace(item, ' ')
正如我所提到的,我得到了上面打印的 TypeError。 我知道 str.replace 參數需要一個字符串,但我認為通過聲明 span 一個列表,我可以遍歷該terms_not_needed
列表以進行單個字符串匹配。 任何指導都會非常有幫助。
您的方法在這里有幾個問題。 一是由於replace
的工作方式,如果您使用它,則沒有理由使用 PhraseMatcher - replace
已經替換了字符串的所有實例。
我會做的是使用on_match
回調將自定義屬性(例如token._.ignore
)設置為 True 以用於匹配器找到的任何內容。 然后,要獲取您感興趣的標記,您只需遍歷 Doc 並獲取該值不為 True 的每個標記。
這是執行此操作的代碼的修改版本:
# Import the required libraries:
import spacy
from spacy.tokens import Token
from spacy.matcher import PhraseMatcher
Token.set_extension("ignore", default=False)
# Declare string from text extracted from a dataframe. Please note that there are many errors in the ASR, including words recognized incorrectly such as "mercado" which a mis-translated utterance from the IVR.
conv_str = "Welcome to companyx, to continue in English, please press one but I contin into mercado. Hello, I am V companyx, virtual assistant to best serve you during our conversation. Please provide your responses after I finished speaking in a few words please tell me what you're calling about. You can say something like I want to change my account information"
nlp = spacy.blank("en")
# call the matcher
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
def set_ignore(matcher, doc, id, matches):
for _, start, end in matches:
for tok in doc[start:end]:
tok._.ignore = True
# Declare a list of strings to search for in another string
terms = ["Welcome to CompanyX", "to continue in English, please press one", "virtual assistant", "In a few words please tell me what you're calling about", "CompanyX"]
# the stack overflow interface is incorrectly coloring some of the term strings, but it works in python
# create patterns variable
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns, on_match=set_ignore)
doc = nlp(conv_str)
# this will run the callback
matcher(doc)
toks = [tok.text + tok.whitespace_ for tok in doc if not tok._.ignore]
print("".join(toks))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.