简体   繁体   English

使用javascript和regex检测多个html标签

[英]detecting multiple html tags with javascript and regex

I am building a chrome extension which would read the current page and detect specific html/xml tags out of it : 我正在构建一个chrome扩展程序,它将读取当前页面并检测其中的特定html / xml标签:

For example if my current page contains the following tags or data : 例如,如果我的当前页面包含以下标签或数据:

some random text here and there

<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>state bank of america</accountName>
<accountHolder>rahul raina</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="-2044388005">
<description>Active Global Equities</description>
<value curCode="USD">159436.01</value>
</holding>
<holding holdingType="mutualFund" uniqueId="-556870249">
<description>Passive Non-US Equities</description> 
<value curCode="USD">72469.76</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>
some data 123

<site name="McKinsey401k">
<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>rahuk</accountName>
<accountHolder>rahuk</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="1285447255">
<description>Special Sits. Aggr. Long-Term</description>
<value curCode="USD">101944.69</value>
</holding>
<holding holdingType="mutualFund" uniqueId="1721876694">
<description>Special Situations Moderate $</description>
<value curCode="USD">49444.98</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>

So I need to identify say tag and print the text between the starting and ending tag ie : "State bank of america" and "rahukk" 因此,我需要识别“说”标签并在开始标签和结束标签之间打印文本,例如:“美国国家银行”和“ rahukk”

So this is what I have done till now: 所以这是我到目前为止所做的:

    function countString(document_r,a,b) {
var test = document_r.body; 
var text = typeof test.textContent == 'string'? test.textContent : test.innerText; 
var testRE = text.match(a+"(.*)"+b);
return testRE[1];

}



chrome.extension.sendMessage({
    action: "getSource",
    source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document,'<accountName>','</accountName>')
});

But this only prints the innertext of only the first tag it encounters in the page ie "State bank of america". 但这仅打印页面中遇到的第一个标签的内文,即“美国国家银行”。

What if I want to print only "rahukk" which is the innertext of last tag in the page or both. 如果我只想打印“ rahukk”(那是​​页面中最后一个标签的内文)或两者都打印该怎么办。

How do I print the innertext of last tag it encounters in the page or how does it print all the tags ? 如何打印页面中遇到的最后一个标签的内部文本,或者如何打印所有标签?

Thanks in advance. 提前致谢。

EDIT : The document above itself is an HTML page i have just put the contents of the page 编辑:上面的文档本身是一个HTML页面,我刚刚将页面内容

UPDATE : So I did some here and there from the suggestions below and the best I could reach by this code : 更新:所以我从下面的建议中到处都做了一些,我可以通过这段代码达到最好的效果:

function countString(document_r) {


var test = document_r.body; 
var text = test.innerText; 

var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var regexg = new RegExp(regex,"g");
var testRE = text.match(regexg);
return testRE;
}

chrome.extension.sendMessage({
    action: "getSource",
    source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document)
});

But this gave me : 但这给了我:

XML DETAILS>>>>> Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)),Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)),Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)) XML详细信息>>>>>退休计划(利润分享退休计划(PSRP)和货币购买退休金计划(MPPP)),退休计划(利润分享退休计划(PSRP)和货币购买退休金计划(MPPP)),退休计划(利润分享退休计划(PSRP)和货币购买退休金计划(MPPP))

This again because the same XML was present in the page 3 times and What I want is that regex to match only from the last XML and I don't want the tag names too. 再次出现这种情况是因为该页面中存在3次相同的XML,而我想要的是该正则表达式仅与最后一个XML相匹配,并且我也不想标记名称。

So my desired output would be: 所以我想要的输出将是:

XML DETAILS>>>>> Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)) XML详细信息>>>>>退休计划(利润分享退休计划(PSRP)和货币购买退休金计划(MPPP))

Regex pattern like this: <accountName>(.*?)<\\/accountName> 这样的正则表达式模式: <accountName>(.*?)<\\/accountName>

var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var testRE = text.match(regex);

=> testRE contains all your matches, in case of tag=accountName it contains "state bank of america" and "rahukk" => testRE包含您的所有匹配项,如果tag = accountName,则其中包含“美国国家银行”和“ rahukk”

UPDATE 更新

According to this page to receive all matches, instead of only the first one, you smust add a "g" flag to the match pattern. 根据接收所有匹配的页面 ,您必须在匹配模式中添加一个“ g”标志,而不是仅接收第一个。

"g: The global search flag makes the RegExp search for a pattern throughout the string, creating an array of all occurrences it can find matching the given pattern." “ g:全局搜索标志使RegExp在整个字符串中搜索一个模式,创建一个可以找到与给定模式匹配的所有匹配项的数组。” found here 这里找到

Hope this helps you! 希望这对您有所帮助!

you match method is not global. 您匹配的方法不是全局的。

var regex = new RegExp(a+"(.*)"+b, "g");
text.match(regex);

If the full XML string is valid, you can parse it into an XML document using the DOMParser.parseFromString method : 如果完整的XML字符串有效,则可以使用DOMParser.parseFromString方法将其解析为XML文档:

var xmlString = '<root>[Valid XML string]</root>';
var parser = new DOMParser();
var doc = parser.parseFromString(xmlString, 'text/xml');

Then you can get a list of tags with a specified name directly: 然后,您可以直接获得具有指定名称的标签列表:

var found = doc.getElementsByTagName('tagName');

Here's a jsFiddle example using the XML you provided, with two minor tweaks—I had to add a root element and an opening tag for the first site . 这是一个使用您提供的XML的jsFiddle示例 ,进行了两个小调整-我必须为第一个site添加一个root元素和一个开始标记。

You don't need regular expressions for your task (besides, read RegEx match open tags except XHTML self-contained tags for why it's not a good idea!). 您无需为任务使用正则表达式(此外,请阅读RegEx匹配打开标签,但XHTML自包含标签除外,以了解为什么它不是一个好主意!)。 You can do this completely via javascript: 您可以通过javascript完全完成此操作:

var tag = "section";
var targets = document.getElementsByTagName(tag);
for (var i = targets.length; i > 0; i--) {
    console.log(targets[i].innerText);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM