简体   繁体   English

允许RegEx中的某些字符

[英]Tolerate certain characters in RegEx

I am writing a message formatting parser that has the capability (among others) to parse links. 我正在编写一种消息格式解析器,该解析器具有(尤其是)解析链接的功能。 This specific case requires parsing a link in the from of <url|linkname> and replacing that text with just the linkname . 这种特定情况需要解析<url|linkname> from的链接,然后仅用linkname替换该文本。 The issue here is that both url or linkname may or may not contain \\1 or \\2 characters anywhere in any order (at most one of each though). 这里的问题是, urllinkname在任何位置的任何位置都可能包含或不包含\\1\\2字符(尽管每个字符最多为一个)。 I want to match the pattern but keep the "invalid" characters. 我想匹配模式,但保留“无效”字符。 This problem solves itself for linkname as that part of the pattern is just ([^\\n+]) , but the url fragment is matched by a much more complicated pattern, more specifically the URL validation pattern from is.js . 这个问题为linkname解决了自己,因为该模式的一部分就是([^\\n+]) ,但是url片段由一个更复杂的模式匹配,更具体地说是is.js的URL验证模式。 It would not be trivial to modify the whole pattern manually to tolerate [\\1\\2] everywhere, and I need the pattern to preserve those characters as they are used for tracking purposes (so I can't simply just .replace(/\\1|\\2/g, "") before matching). 手动修改整个模式以容忍[\\1\\2]到处都是不容易的,而且我需要该模式来保留那些用于跟踪目的的字符(因此,我不能仅仅只是.replace(/\\1|\\2/g, "")匹配之前)。

If this kind of matching is not possible, is there some automated way to reliably modify the RegExp to add [\\1\\2]{0,2} between every character match, add \\1\\2 to all [chars] matches, etc. 如果无法进行这种匹配,是否有某种自动方法可以可靠地修改RegExp,以在每个字符匹配之间添加[\\1\\2]{0,2} ,向所有[chars]匹配添加\\1\\2 ,等等。 。

This is the url pattern taken from is.js : 这是从is.js提取的url模式:

/(?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?/i

This pattern was adapted for my purposes and for the <url|linkname> format as follows: 该模式针对我的目的和<url|linkname>格式进行了如下调整:

let namedUrlRegex = /<((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)\|([^\n]+)>/ig;

The code where this is used is here: JSFiddle 此处使用的代码是: JSFiddle

Examples for clarification ( ... represents the namedUrlRegex variable from above, and $2 is the capture group that captures linkname ): 澄清示例( ...表示上方的namedUrlRegex变量, $2是捕获linkname的捕获组):

Current behavior:
"<googl\1e.com|Google>".replace(..., "$2") // "<googl\1e.com|Google>" WRONG
"<google.com|Goo\1gle>".replace(..., "$2") // "Goo\1gle"              CORRECT
"<not_\1a_url|Google>".replace(..., "$2") // "<not_\1a_url|Google>"   CORRECT

Expected behavior:
"<googl\1e.com|Google>".replace(..., "$2") // "Google" (note there is no \1)
"<google.com|Goo\1gle>".replace(..., "$2") // "Goo\1gle"
"<not_\1a_url|Google>".replace(..., "$2") // "<not_\1a_url|Google>"

Note the same rules for \\1 apply to \\2 , \\1\\2 , \\1...\\2 , \\2...\\1 etc 注意\\1的相同规则适用于\\2\\1\\2\\1...\\2\\2...\\1

Context: This is used to normalize a string from a WYSIWYG editor to the length/content that it will display as, preserving the location of the current selection (denoted by \\1 and \\2 so it can be restored after parsing). 上下文:用于将WYSIWYG编辑器中的字符串规范化为将要显示的长度/内容,从而保留当前选择的位置(用\\1\\2表示,以便在解析后可以将其还原)。 If the "caret" is removed completely (eg if the cursor was in the URL of a link), it will select the whole string instead. 如果“插入符号”被完全删除(例如,如果光标位于链接的URL中),它将选择整个字符串。 Everything works as expected, except for when the selection starts or ends in the url fragment. 一切正常,除了选择在URL片段中开始或结束时。

Edit for clarification : I only want to change a segment in a string if it follows the format of <url|linkname> where url matches the URL pattern (tolerating \\1 , \\2 ) and linkname consists of non- \\n characters. 编辑澄清 :我要改变的段在一个字符串,如果它遵循的格式<url|linkname>其中url的URL模式匹配(容忍\\1\\2 )和linkname由非\\n字符。 If this condition is not met within a <...|...> string, it should be left unaltered as per the not_a_url example above. 如果<...|...>字符串中不满足此条件,则应按照上述not_a_url示例将其保持不变

I ended up making a RegEx that matches all "symbols" in the expression. 我最终制作了一个与表达式中所有“符号”匹配的RegEx。 One quirk of this is that it expects : , = , ! 这方面的一个怪癖,即它要求:=! characters to be escaped, even outside of a (?:...) , (?=...) , (?!...) expression. 即使在(?:...)(?=...)(?!...)表达式之外也要转义的字符。 This is addressed by escaping them before processing. 通过在处理之前将它们转义来解决此问题。

Fiddle

let r = /(\\.|\[.+?\]|\w|[^\\\/\[\]\^\$\(\)\?\*\+\{\}\|\+\:\=\!]|(\{.+?\}))(?:((?:\{.+?\}|\+|\*)\??)|\??)/g;

let url = /((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)/

function tolerate(regex, insert) {
    let first = true;
        // convert to string
    return regex.toString().replace(/\/(.+)\//, "$1").
        // escape :=!
        replace(/((?:^|[^\\])\\(?:\\)*\(\?|[^?])([:=!]+)/g, (m, g1, g2) => g1 + (g2.split("").join("\\"))).
        // substitute string
        replace(r, function(m, g1, g2, g3, g4) {
            // g2 = {...} multiplier (to prevent matching digits as symbols)
            if (g2) return m;
            // g3 = multiplier after symbol (must wrap in parenthesis to preserve behavior)
            if (g3) return "(?:" + insert + g1 + ")" + g3;
            // prevent matching tolerated characters at beginning, remove to change this behavior
            if (first) {
                first = false;
                return m;
            }
            // insert the insert
            return insert + m;
        }
    );
}

alert(tolerate(url, "\1?\2?"));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM