简体   繁体   English

在JavaScript Regex中查找标记之外的字符?

[英]Finding a character outside of markup in JavaScript Regex?

I'm trying to find & characters via regex which fit a particular rule to avoid formatting for markdown parsing. 我试图找到&通过正则表达式,其符合特定的规则字符以避免格式化降价解析。 The characters should only be matched where they are outside of <> tags (eg *<a href="...">*</a>* ), and outside of parenthesis which are not immediately preceded with leading square brackets, such as *[*]()* .) 字符只应匹配在<>标签之外的位置(例如*<a href="...">*</a>* ),并且在括号之外,前面没有前导方括号,例如as *[*]()* 。)

The current version of the regex which works for the first case is: 适用于第一种情况的正则表达式的当前版本是:

/(\&)(?![^<]*>|[<>]*<\/)/gi

And can be viewed here . 并且可以在这里查看。 In this case the third match on the third line should not match. 在这种情况下,第三行的第三场比赛不应该匹配。

In addition the test case in the link above is below for the sake of not relying entirely on external sites: 此外,上面链接中的测试用例是为了不完全依赖外部站点:

& <a href="http://www.google.com?a=b&c=d"> & </a> &
& <a href="http://www.google.com?a=b&c=d"> & </a> &
& ![test & amp](http://www.google.com?a=b&c=d) &
& all the amps on this line should match [ & ] (&) [ &] ( & ) [& ] (& )[&] ( &) &
& <a href="http://www.google.com?a=b&c=d"> & </a> &
& <a href="http://www.google.com?a=b&c=d"> & </a> && <a href="http://www.google.com?a=b&c=d"> & </a> && <a href="http://www.google.com?a=b&c=d"> & </a> &
& <a href="http://www.google.com?a=b&c=d"> & </a> &
function processTextNodes(htmlString, callback) {
    var div = document.createElement('div');
    div.innerHTML = htmlString;

    var elements = [div];
    var element, child, i;

    while (elements.length) {
        element = elements.shift();
        for (i = 0; i < element.childNodes.length; i++) {
            child = element.childNodes[i];
            if (child.nodeType === element.ELEMENT_NODE) {
                elements.push(child);
            } else if (child.nodeType === element.TEXT_NODE) {
                child.textContent = callback(child);
            }
        }
    }

    return div.innerHTML;
}

usage 用法

var html = 'hello <h1>This is a heading & a <span>nested value</span></h1> bye!';

processTextNodes(html, function (textNode) {
    return textNode.textContent.toUpperCase();
});

gives you 给你

"HELLO <h1>THIS IS A HEADING &amp; A <span>NESTED VALUE</span></h1> BYE!"

Note how the escaping is done by the browser's HTML parser. 请注意浏览器的HTML解析器如何完成转义。 Don't try to re-implement that, especially not with regex. 不要试图重新实现它,特别是不要使用正则表达式。 The world's most powerful HTML parser that can even deal with any kind of broken input is right at your fingertips. 世界上最强大的HTML解析器甚至可以处理任何类型的破坏输入,触手可及。 Use it. 用它。

If you don't need the "process text node values" part, remove it and the function becomes very short: 如果您不需要“过程文本节点值”部分,请将其删除并且该函数变得非常短:

function fixHTML(htmlString) {
    var div = document.createElement('div');
    div.innerHTML = htmlString;    
    return div.innerHTML;
}

For anyone who happens to come across this question, contrary to what someone on this page is suggesting, it is not impossible. 对于碰巧碰到这个问题的人来说,与本页上某人的建议相反,这并非不可能。 I was able to get it by using lookbehinds after enabling the experimental JavaScript features within the V8 engine. 在启用V8引擎中的实验性JavaScript功能后,我能够通过使用lookbehinds来获得它。 The following will work in Chrome after going to chrome://flags and checking off Experimental JavaScript or running node.js with the --harmony option. 在转到chrome:// flags并检查Experimental JavaScript或使用--harmony选项运行node.js之后,以下内容将适用于Chrome。

/(?<!(?<=\[(.*))\]\(([a-zA-Z0-9\-\.\_\~\:\/\?\#\[\]\@\!\$\&\'\(\)\*\+\,\;\=\%]*))(\&)(?![^<]*>|[<>]*<\/)/gi

Example fiddle . 示例小提琴 (must have Harmony enabled within Chrome to view correctly) (必须在Chrome中启用Harmony才能正确查看)

Hopefully lookbehinds will make it into the next ECMAScript standard so the other experimental JS stuff won't be needed with it. 希望lookbehinds将成为下一个ECMAScript标准,因此不需要其他实验性的JS东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM