[英]spaCy's regex is different to Python's regex
我有以下文字
text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'
當我使用普通的正則表達式時,我獲得了以下內容
import re
regex = '\d{1}[a|p]m'
re.findall(regex, text)
# Returned:
['5am', '6am', '9pm', '6am', '6am', '6pm']
但是,當我在spaCy中使用相同的regex
時,我什么都沒有回來
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')
matcher = Matcher(nlp.vocab)
pattern = [{'TEXT': {'REGEX': '\d{1}[a|p]m'}}]
matcher.add('TIME', None, pattern)
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.sent.text)
這是否意味着我們不能將正常的正則表達式與spaCy一起使用? 如果是這樣,你知道我在哪里可以學習spaCy的特殊正則表達式語法嗎? 謝謝。
您需要記住,數字將與此處的字母分開,請參閱測試:
doc = nlp("1pm")
print([token.text for token in doc]) # => ['1', 'pm']
根據Spacy文檔 :
如果spaCy的標記化與模式中定義的標記不匹配,則該模式不會產生任何結果。
您需要使用基於規則的匹配來定義自己的實體:
pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]
然后將其添加到匹配器:
matcher.add('TIME', None, pattern)
得到比賽:
for match_id, start, end in matches:
span = doc[start:end] # The matched span
print(span.text)
完整演示:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]
matcher.add('TIME', None, pattern)
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])
#=> [5am, 6am, 9pm, 6am, 6am, 9pm]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.