[英]Tolerate certain characters in RegEx
I am writing a message formatting parser that has the capability (among others) to parse links. 我正在编写一种消息格式解析器,该解析器具有(尤其是)解析链接的功能。 This specific case requires parsing a link in the from of
<url|linkname>
and replacing that text with just the linkname
. 这种特定情况需要解析
<url|linkname>
from的链接,然后仅用linkname
替换该文本。 The issue here is that both url
or linkname
may or may not contain \\1
or \\2
characters anywhere in any order (at most one of each though). 这里的问题是,
url
或linkname
在任何位置的任何位置都可能包含或不包含\\1
或\\2
字符(尽管每个字符最多为一个)。 I want to match the pattern but keep the "invalid" characters. 我想匹配模式,但保留“无效”字符。 This problem solves itself for
linkname
as that part of the pattern is just ([^\\n+])
, but the url
fragment is matched by a much more complicated pattern, more specifically the URL validation pattern from is.js . 这个问题为
linkname
解决了自己,因为该模式的一部分就是([^\\n+])
,但是url
片段由一个更复杂的模式匹配,更具体地说是is.js的URL验证模式。 It would not be trivial to modify the whole pattern manually to tolerate [\\1\\2]
everywhere, and I need the pattern to preserve those characters as they are used for tracking purposes (so I can't simply just .replace(/\\1|\\2/g, "")
before matching). 手动修改整个模式以容忍
[\\1\\2]
到处都是不容易的,而且我需要该模式来保留那些用于跟踪目的的字符(因此,我不能仅仅只是.replace(/\\1|\\2/g, "")
匹配之前)。
If this kind of matching is not possible, is there some automated way to reliably modify the RegExp to add [\\1\\2]{0,2}
between every character match, add \\1\\2
to all [chars]
matches, etc. 如果无法进行这种匹配,是否有某种自动方法可以可靠地修改RegExp,以在每个字符匹配之间添加
[\\1\\2]{0,2}
,向所有[chars]
匹配添加\\1\\2
,等等。 。
This is the url
pattern taken from is.js
: 这是从
is.js
提取的url
模式:
/(?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?/i
This pattern was adapted for my purposes and for the <url|linkname>
format as follows: 该模式针对我的目的和
<url|linkname>
格式进行了如下调整:
let namedUrlRegex = /<((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)\|([^\n]+)>/ig;
The code where this is used is here: JSFiddle 此处使用的代码是: JSFiddle
Examples for clarification ( ...
represents the namedUrlRegex
variable from above, and $2
is the capture group that captures linkname
): 澄清示例(
...
表示上方的namedUrlRegex
变量, $2
是捕获linkname
的捕获组):
Current behavior:
"<googl\1e.com|Google>".replace(..., "$2") // "<googl\1e.com|Google>" WRONG
"<google.com|Goo\1gle>".replace(..., "$2") // "Goo\1gle" CORRECT
"<not_\1a_url|Google>".replace(..., "$2") // "<not_\1a_url|Google>" CORRECT
Expected behavior:
"<googl\1e.com|Google>".replace(..., "$2") // "Google" (note there is no \1)
"<google.com|Goo\1gle>".replace(..., "$2") // "Goo\1gle"
"<not_\1a_url|Google>".replace(..., "$2") // "<not_\1a_url|Google>"
Note the same rules for
\\1
apply to\\2
,\\1\\2
,\\1...\\2
,\\2...\\1
etc注意
\\1
的相同规则适用于\\2
,\\1\\2
,\\1...\\2
,\\2...\\1
等Context: This is used to normalize a string from a WYSIWYG editor to the length/content that it will display as, preserving the location of the current selection (denoted by
\\1
and\\2
so it can be restored after parsing).上下文:用于将WYSIWYG编辑器中的字符串规范化为将要显示的长度/内容,从而保留当前选择的位置(用
\\1
和\\2
表示,以便在解析后可以将其还原)。 If the "caret" is removed completely (eg if the cursor was in the URL of a link), it will select the whole string instead.如果“插入符号”被完全删除(例如,如果光标位于链接的URL中),它将选择整个字符串。 Everything works as expected, except for when the selection starts or ends in the url fragment.
一切正常,除了选择在URL片段中开始或结束时。
Edit for clarification : I only want to change a segment in a string if it follows the format of
<url|linkname>
whereurl
matches the URL pattern (tolerating\\1
,\\2
) andlinkname
consists of non-\\n
characters.编辑澄清 :我只要改变的段在一个字符串,如果它遵循的格式
<url|linkname>
其中url
的URL模式匹配(容忍\\1
,\\2
)和linkname
由非\\n
字符。 If this condition is not met within a<...|...>
string, it should be left unaltered as per thenot_a_url
example above.如果
<...|...>
字符串中不满足此条件,则应按照上述not_a_url
示例将其保持不变 。
I ended up making a RegEx that matches all "symbols" in the expression. 我最终制作了一个与表达式中所有“符号”匹配的RegEx。 One quirk of this is that it expects
:
, =
, !
这方面的一个怪癖,即它要求
:
, =
, !
characters to be escaped, even outside of a (?:...)
, (?=...)
, (?!...)
expression. 即使在
(?:...)
, (?=...)
, (?!...)
表达式之外也要转义的字符。 This is addressed by escaping them before processing. 通过在处理之前将它们转义来解决此问题。
let r = /(\\.|\[.+?\]|\w|[^\\\/\[\]\^\$\(\)\?\*\+\{\}\|\+\:\=\!]|(\{.+?\}))(?:((?:\{.+?\}|\+|\*)\??)|\??)/g;
let url = /((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)/
function tolerate(regex, insert) {
let first = true;
// convert to string
return regex.toString().replace(/\/(.+)\//, "$1").
// escape :=!
replace(/((?:^|[^\\])\\(?:\\)*\(\?|[^?])([:=!]+)/g, (m, g1, g2) => g1 + (g2.split("").join("\\"))).
// substitute string
replace(r, function(m, g1, g2, g3, g4) {
// g2 = {...} multiplier (to prevent matching digits as symbols)
if (g2) return m;
// g3 = multiplier after symbol (must wrap in parenthesis to preserve behavior)
if (g3) return "(?:" + insert + g1 + ")" + g3;
// prevent matching tolerated characters at beginning, remove to change this behavior
if (first) {
first = false;
return m;
}
// insert the insert
return insert + m;
}
);
}
alert(tolerate(url, "\1?\2?"));
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.