簡體   English   中英

spaCy的正則表達式與Python的正則表達式不同

[英]spaCy's regex is different to Python's regex

我有以下文字

text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'

當我使用普通的正則表達式時,我獲得了以下內容

import re
regex = '\d{1}[a|p]m'
re.findall(regex, text)

# Returned:
['5am', '6am', '9pm', '6am', '6am', '6pm']

但是,當我在spaCy中使用相同的regex時,我什么都沒有回來

from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')

matcher = Matcher(nlp.vocab)
pattern = [{'TEXT': {'REGEX': '\d{1}[a|p]m'}}]
matcher.add('TIME', None, pattern)

doc = nlp(text)
matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.sent.text)

這是否意味着我們不能將正常的正則表達式與spaCy一起使用? 如果是這樣,你知道我在哪里可以學習spaCy的特殊正則表達式語法嗎? 謝謝。

您需要記住,數字將與此處的字母分開,請參閱測試:

doc = nlp("1pm")
print([token.text for token in doc]) # => ['1', 'pm']

根據Spacy文檔

如果spaCy的標記化與模式中定義的標記不匹配,則該模式不會產生任何結果。

您需要使用基於規則的匹配來定義自己的實體:

pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]

然后將其添加到匹配器:

matcher.add('TIME', None, pattern)

得到比賽:

for match_id, start, end in matches:
    span = doc[start:end]  # The matched span
    print(span.text)

完整演示:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'
doc = nlp(text)

matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]
matcher.add('TIME', None, pattern)

matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])
#=> [5am, 6am, 9pm, 6am, 6am, 9pm]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM