使用javascript和regex检测多个html标签

Question

I am building a chrome extension which would read the current page and detect specific html/xml tags out of it : 我正在构建一个chrome扩展程序，它将读取当前页面并检测其中的特定html / xml标签：

For example if my current page contains the following tags or data : 例如，如果我的当前页面包含以下标签或数据：

some random text here and there

<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>state bank of america</accountName>
<accountHolder>rahul raina</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="-2044388005">
<description>Active Global Equities</description>
<value curCode="USD">159436.01</value>
</holding>
<holding holdingType="mutualFund" uniqueId="-556870249">
<description>Passive Non-US Equities</description> 
<value curCode="USD">72469.76</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>
some data 123

<site name="McKinsey401k">
<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>rahuk</accountName>
<accountHolder>rahuk</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="1285447255">
<description>Special Sits. Aggr. Long-Term</description>
<value curCode="USD">101944.69</value>
</holding>
<holding holdingType="mutualFund" uniqueId="1721876694">
<description>Special Situations Moderate $</description>
<value curCode="USD">49444.98</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>

So I need to identify say tag and print the text between the starting and ending tag ie : "State bank of america" and "rahukk" 因此，我需要识别“说”标签并在开始标签和结束标签之间打印文本，例如：“美国国家银行”和“ rahukk”

So this is what I have done till now: 所以这是我到目前为止所做的：

    function countString(document_r,a,b) {
var test = document_r.body; 
var text = typeof test.textContent == 'string'? test.textContent : test.innerText; 
var testRE = text.match(a+"(.*)"+b);
return testRE[1];

}



chrome.extension.sendMessage({
    action: "getSource",
    source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document,'<accountName>','</accountName>')
});

But this only prints the innertext of only the first tag it encounters in the page ie "State bank of america". 但这仅打印页面中遇到的第一个标签的内文，即“美国国家银行”。

What if I want to print only "rahukk" which is the innertext of last tag in the page or both. 如果我只想打印“ rahukk”（那是页面中最后一个标签的内文）或两者都打印该怎么办。

How do I print the innertext of last tag it encounters in the page or how does it print all the tags ? 如何打印页面中遇到的最后一个标签的内部文本，或者如何打印所有标签？

Thanks in advance. 提前致谢。

EDIT : The document above itself is an HTML page i have just put the contents of the page 编辑：上面的文档本身是一个HTML页面，我刚刚将页面内容

UPDATE : So I did some here and there from the suggestions below and the best I could reach by this code : 更新：所以我从下面的建议中到处都做了一些，我可以通过这段代码达到最好的效果：

function countString(document_r) {


var test = document_r.body; 
var text = test.innerText; 

var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var regexg = new RegExp(regex,"g");
var testRE = text.match(regexg);
return testRE;
}

chrome.extension.sendMessage({
    action: "getSource",
    source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document)
});

But this gave me : 但这给了我：

XML DETAILS>>>>> Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)),Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)),Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)) XML详细信息>>>>>退休计划（利润分享退休计划（PSRP）和货币购买退休金计划（MPPP）），退休计划（利润分享退休计划（PSRP）和货币购买退休金计划（MPPP）），退休计划（利润分享退休计划（PSRP）和货币购买退休金计划（MPPP））

This again because the same XML was present in the page 3 times and What I want is that regex to match only from the last XML and I don't want the tag names too. 再次出现这种情况是因为该页面中存在3次相同的XML，而我想要的是该正则表达式仅与最后一个XML相匹配，并且我也不想标记名称。

So my desired output would be: 所以我想要的输出将是：

XML DETAILS>>>>> Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)) XML详细信息>>>>>退休计划（利润分享退休计划（PSRP）和货币购买退休金计划（MPPP））

Answer 1

Regex pattern like this: <accountName>(.*?)<\\/accountName> 这样的正则表达式模式： <accountName>(.*?)<\\/accountName>

var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var testRE = text.match(regex);

=> testRE contains all your matches, in case of tag=accountName it contains "state bank of america" and "rahukk" => testRE包含您的所有匹配项，如果tag = accountName，则其中包含“美国国家银行”和“ rahukk”

UPDATE 更新

According to this page to receive all matches, instead of only the first one, you smust add a "g" flag to the match pattern. 根据接收所有匹配的页面，您必须在匹配模式中添加一个“ g”标志，而不是仅接收第一个。

"g: The global search flag makes the RegExp search for a pattern throughout the string, creating an array of all occurrences it can find matching the given pattern." “ g：全局搜索标志使RegExp在整个字符串中搜索一个模式，创建一个可以找到与给定模式匹配的所有匹配项的数组。” found here 在这里找到

Hope this helps you! 希望这对您有所帮助！

Answer 2

you match method is not global. 您匹配的方法不是全局的。

var regex = new RegExp(a+"(.*)"+b, "g");
text.match(regex);

Answer 3

If the full XML string is valid, you can parse it into an XML document using the DOMParser.parseFromString method : 如果完整的XML字符串有效，则可以使用DOMParser.parseFromString方法将其解析为XML文档：

var xmlString = '<root>[Valid XML string]</root>';
var parser = new DOMParser();
var doc = parser.parseFromString(xmlString, 'text/xml');

Then you can get a list of tags with a specified name directly: 然后，您可以直接获得具有指定名称的标签列表：

var found = doc.getElementsByTagName('tagName');

Here's a jsFiddle example using the XML you provided, with two minor tweaks—I had to add a root element and an opening tag for the first site . 这是一个使用您提供的XML的jsFiddle示例 ，进行了两个小调整-我必须为第一个site添加一个root元素和一个开始标记。

Answer 4

You don't need regular expressions for your task (besides, read RegEx match open tags except XHTML self-contained tags for why it's not a good idea!). 您无需为任务使用正则表达式（此外，请阅读RegEx匹配打开标签，但XHTML自包含标签除外，以了解为什么它不是一个好主意！）。 You can do this completely via javascript: 您可以通过javascript完全完成此操作：

var tag = "section";
var targets = document.getElementsByTagName(tag);
for (var i = targets.length; i > 0; i--) {
    console.log(targets[i].innerText);
}

使用javascript和regex检测多个html标签

问题描述

4 个解决方案

解决方案1
1 2013-10-23 07:12:11

解决方案2
1 2013-10-23 07:12:18

解决方案3
1 2013-10-23 08:41:44

解决方案4
0 2013-10-23 08:03:56

使用javascript和regex检测多个html标签

问题描述

4 个解决方案

解决方案1 1 2013-10-23 07:12:11

解决方案2 1 2013-10-23 07:12:18

解决方案3 1 2013-10-23 08:41:44

解决方案4 0 2013-10-23 08:03:56

解决方案1
1 2013-10-23 07:12:11

解决方案2
1 2013-10-23 07:12:18

解决方案3
1 2013-10-23 08:41:44

解决方案4
0 2013-10-23 08:03:56