JavaScript RegExp后备替代方案？

Question

我做了一个正则表达式，匹配<a>中的title="..." ； 不幸的是，它也与<img/>中的title="..."相匹配。

有没有办法告诉正则表达式只在<a>查找title="..." ？ 我不能使用像(?<=<a\\s+)这样的后向方法，因为JavaScript 不支持它们。

这是我的表情：

/((title=".+")(?=\s*href))|(title=".+")/igm;

上面的表达式与以下内容匹配：

在此处输入图片说明

如您所见，它与<img/>中的title="..."匹配； 我需要表达式以排除在图像标签中找到的标题。

这是RegExp的链接。

另外，如果可能的话，我需要删除标题周围的title =“”。 因此，仅返回title AFTER href和title BEFORE href 。 如果不可能，我想我可以使用.replace()并将其替换为"" 。

zx81的表达式：

在此处输入图片说明

Answer 1

首先，您必须知道大多数人都喜欢使用DOM解析器来解析html，因为regex会带来某些危害。 话虽如此，对于这个简单的任务（无嵌套），这是您可以在正则表达式中执行的操作。

使用捕获组

我们在JavaScript中没有后退或\\K ，但是我们可以将我们喜欢的内容捕获到捕获组中，然后从该组中检索匹配项，而忽略其余部分。

此正则表达式捕获了组1的标题：

<a [^>]*?(title="[^"]*")

在演示中，在右窗格中查看“ Group 1”捕获：这就是我们感兴趣的内容。

示例JavaScript代码

var unique_results = []; 
var yourString = 'your_test_string'
var myregex = /<a [^>]*?(title="[^"]*")/g;
var thematch = myregex.exec(yourString);
while (thematch != null) {
    // is it unique?
    if(unique_results.indexOf(thematch[1]) <0) {
        // add it to array of unique results
        unique_results.push(thematch[1]);
        document.write(thematch[1],"<br />");    
    }
    // match the next one
    thematch = myregex.exec(yourString);
}

说明

<a匹配标签的开头
[^>]*? 懒惰地匹配不是>任何字符，最多...
(捕获组
title="文字字符
[^"]*任何非引号的字符
"结束语
)第1组

Answer 2

我不确定是否可以使用JavaScript中的单个正则表达式来完成此操作； 但是，您可以执行以下操作：

http://jsfiddle.net/KYfKT/1/

var str = '\
<a href="www.google.com" title="some title">\
<a href="www.google.com" title="some other title">\
<a href="www.google.com">\
<img href="www.google.com" title="some title">\
';

var matches = [];
//-- somewhat hacky use of .replace() in order to utilize the callback on each <a> tag
str.replace(/\<a[^\>]+\>/g, function (match) {
    //-- if the <a> tag includes a title, push it onto matches
    var title = match.match(/((title=".+")(?=\s*href))|(title=".+")/igm);
    title && matches.push(title[0].substr(7, title[0].length - 8));
});

document.body.innerText = JSON.stringify(matches);

您应该为此使用DOM，而不是正则表达式：

http://jsfiddle.net/KYfKT/3/

var str = '\
<a href="www.google.com" title="some title">Some Text</a>\
<a href="www.google.com" title="some other title">Some Text</a>\
<a href="www.google.com">Some Text</a>\
<img href="www.google.com" title="some title"/>\
';

var div = document.createElement('div');
div.innerHTML = str;
var titles = Array.apply(this, div.querySelectorAll('a[title]')).map(function (item) { return item.title; });

document.body.innerText = titles;

Answer 3

我不确定您的html源来自何处，但是我确实知道某些浏览器在提取为“ innerHTML”时不尊重源的大小写（或属性顺序）。

同样，作者和浏览器都可以使用单引号和双引号。
这些是我所知道的最常见的2个跨浏览器陷阱。

因此，您可以尝试： /<a [^>]*?title=(['"])([^\\1]*?)\\1/gi

它使用反向引用执行非贪婪不区分大小写的搜索，以解决单引号与双引号的情况。

第一部分已经由zx81的答案进行了解释。 \\1匹配第一个捕获组，因此它匹配使用的开头引号。 现在，第二个捕获组应包含裸标题字符串。

一个简单的例子：

var rxp=/<a [^>]*?title=(['"])([^\1]*?)\1/gi
,   res=[]
,   tmp
;

while( tmp=rxp.exec(str) ){  // str is your string
  res.push( tmp[2] );        //example of adding the strings to an array.
}

但是，正如其他人指出的那样，使用正则表达式标记汤（又称HTML）确实（通常）是不好的。 罗伯特·梅塞尔（Robert Messerle）的替代方案（使用DOM）更可取！

警告（我差点忘了）
IE6（及其他？）具有此出色的“内存节省功能”，可方便地删除所有不需要的引号（对于不需要空格的字符串）。 因此，在那里，此正则表达式（和zx81的）会失败，因为它们依赖于引号的使用！！！！ 返回绘图板..（重新指定HTML时，这个过程似乎永无止境）。

JavaScript RegExp后备替代方案？

问题描述

3 个解决方案

解决方案1
2 2014-06-30 02:48:22

解决方案2
1 已采纳 2014-06-30 02:44:25

解决方案3
1 2014-06-30 03:54:40

JavaScript RegExp后备替代方案？

问题描述

3 个解决方案

解决方案1 2 2014-06-30 02:48:22

解决方案2 1 已采纳 2014-06-30 02:44:25

解决方案3 1 2014-06-30 03:54:40

解决方案1
2 2014-06-30 02:48:22

解决方案2
1 已采纳 2014-06-30 02:44:25

解决方案3
1 2014-06-30 03:54:40