![](/img/trans.png)
[英]RegEx to match character followed or preceded by another character or both but not neither
[英]Match a list of words preceded and followed by some special character
我正在尝试编写一个正则表达式来匹配很长的单词列表(4000 个单词),如果单词位于字符串的开头或字符串的末尾,或者前后跟一个特殊字符,则当前的正则表达式我是使用是这样的:
((?:[^a-zA-Z0-9]|^)FIND(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)ANY(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)MATCHING(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)WORD(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)BY(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)THIS(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)VERY(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)LONG(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)REGEX(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)PATTERN(?:[^a-zA-Z0-9]|$))
这个正则表达式持续了大约 4000 个单词,我使用 python re module / ripgrep 检查一些字符串是否匹配,我想知道每个字符串的匹配单词。
我使用非捕获组,因为我并不介意单词之前或之后的内容,只有单词匹配自己。
但是,对于我测试过的一些通用字符串,在树莓派上每次迭代大约需要 3-4 秒,我想知道我是否可以以某种方式为这种用法生成更快的模式。
谢谢。
首先,这里使用(?:[^a-zA-Z0-9]|^)
和(?:[^a-zA-Z0-9]|$)
模式作为排除_
单词边界。 简化它们并分别使用(?<![^\\W_])
和(?![^\\W_])
是有意义的。
接下来,可以处理您拥有的单词以创建正则表达式以进行高效搜索。
这是一个示例代码:
from trieregex import TrieRegEx
keywords = ['FIND', 'ANY', 'MATCHING', 'WORD', 'BY', 'THIS', 'VERY', 'LONG', 'REGEX', 'PATTERN', 'PARROT', 'FIGHT']
pattern = fr'(?<![^\W_])({TrieRegEx(*keywords).regex()})(?![^\W_])'
# => (?<![^\W_])((?:PA(?:TTERN|RROT)|FI(?:GHT|ND)|MATCHING|REGEX|LONG|THIS|VERY|WORD|ANY|BY))(?![^\W_])
只需确保事先安装trieregex
。
查看生成的正则表达式模式。
另请参阅基于此正则表达式 trie 解决方案的另一个演示:
import re
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
text = r'FIND ANY MATCHING WORD BY THIS VERY LONG REGEX PATTERN FIGHT FIGHTER PARROT PARROT_ING'
keywords = ['FIND', 'ANY', 'MATCHING', 'WORD', 'BY', 'THIS', 'VERY', 'LONG', 'REGEX', 'PATTERN', 'PARROT', 'FIGHT']
trie = Trie()
for word in keywords:
trie.add(word)
pattern = fr'(?<![^\W_])({trie.pattern()})(?![^\W_])'
print(re.findall(pattern, text))
输出:
['FIND', 'ANY', 'MATCHING', 'WORD', 'BY', 'THIS', 'VERY', 'LONG', 'REGEX', 'PATTERN', 'FIGHT', 'PARROT', 'PARROT']
注意出现了两次PARROT
,最后一次来自PARROT_ING
字符串部分。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.