简体   繁体   中英

detecting multiple html tags with javascript and regex

I am building a chrome extension which would read the current page and detect specific html/xml tags out of it :

For example if my current page contains the following tags or data :

some random text here and there

<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>state bank of america</accountName>
<accountHolder>rahul raina</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="-2044388005">
<description>Active Global Equities</description>
<value curCode="USD">159436.01</value>
</holding>
<holding holdingType="mutualFund" uniqueId="-556870249">
<description>Passive Non-US Equities</description> 
<value curCode="USD">72469.76</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>
some data 123

<site name="McKinsey401k">
<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>rahuk</accountName>
<accountHolder>rahuk</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="1285447255">
<description>Special Sits. Aggr. Long-Term</description>
<value curCode="USD">101944.69</value>
</holding>
<holding holdingType="mutualFund" uniqueId="1721876694">
<description>Special Situations Moderate $</description>
<value curCode="USD">49444.98</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>

So I need to identify say tag and print the text between the starting and ending tag ie : "State bank of america" and "rahukk"

So this is what I have done till now:

    function countString(document_r,a,b) {
var test = document_r.body; 
var text = typeof test.textContent == 'string'? test.textContent : test.innerText; 
var testRE = text.match(a+"(.*)"+b);
return testRE[1];

}



chrome.extension.sendMessage({
    action: "getSource",
    source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document,'<accountName>','</accountName>')
});

But this only prints the innertext of only the first tag it encounters in the page ie "State bank of america".

What if I want to print only "rahukk" which is the innertext of last tag in the page or both.

How do I print the innertext of last tag it encounters in the page or how does it print all the tags ?

Thanks in advance.

EDIT : The document above itself is an HTML page i have just put the contents of the page

UPDATE : So I did some here and there from the suggestions below and the best I could reach by this code :

function countString(document_r) {


var test = document_r.body; 
var text = test.innerText; 

var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var regexg = new RegExp(regex,"g");
var testRE = text.match(regexg);
return testRE;
}

chrome.extension.sendMessage({
    action: "getSource",
    source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document)
});

But this gave me :

XML DETAILS>>>>> Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)),Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)),Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP))

This again because the same XML was present in the page 3 times and What I want is that regex to match only from the last XML and I don't want the tag names too.

So my desired output would be:

XML DETAILS>>>>> Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP))

Regex pattern like this: <accountName>(.*?)<\\/accountName>

var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var testRE = text.match(regex);

=> testRE contains all your matches, in case of tag=accountName it contains "state bank of america" and "rahukk"

UPDATE

According to this page to receive all matches, instead of only the first one, you smust add a "g" flag to the match pattern.

"g: The global search flag makes the RegExp search for a pattern throughout the string, creating an array of all occurrences it can find matching the given pattern." found here

Hope this helps you!

you match method is not global.

var regex = new RegExp(a+"(.*)"+b, "g");
text.match(regex);

If the full XML string is valid, you can parse it into an XML document using the DOMParser.parseFromString method :

var xmlString = '<root>[Valid XML string]</root>';
var parser = new DOMParser();
var doc = parser.parseFromString(xmlString, 'text/xml');

Then you can get a list of tags with a specified name directly:

var found = doc.getElementsByTagName('tagName');

Here's a jsFiddle example using the XML you provided, with two minor tweaks—I had to add a root element and an opening tag for the first site .

You don't need regular expressions for your task (besides, read RegEx match open tags except XHTML self-contained tags for why it's not a good idea!). You can do this completely via javascript:

var tag = "section";
var targets = document.getElementsByTagName(tag);
for (var i = targets.length; i > 0; i--) {
    console.log(targets[i].innerText);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM