[英]detecting multiple html tags with javascript and regex
I am building a chrome extension which would read the current page and detect specific html/xml tags out of it : 我正在构建一个chrome扩展程序,它将读取当前页面并检测其中的特定html / xml标签:
For example if my current page contains the following tags or data : 例如,如果我的当前页面包含以下标签或数据:
some random text here and there
<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>state bank of america</accountName>
<accountHolder>rahul raina</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="-2044388005">
<description>Active Global Equities</description>
<value curCode="USD">159436.01</value>
</holding>
<holding holdingType="mutualFund" uniqueId="-556870249">
<description>Passive Non-US Equities</description>
<value curCode="USD">72469.76</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>
some data 123
<site name="McKinsey401k">
<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>rahuk</accountName>
<accountHolder>rahuk</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="1285447255">
<description>Special Sits. Aggr. Long-Term</description>
<value curCode="USD">101944.69</value>
</holding>
<holding holdingType="mutualFund" uniqueId="1721876694">
<description>Special Situations Moderate $</description>
<value curCode="USD">49444.98</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>
So I need to identify say tag and print the text between the starting and ending tag ie : "State bank of america" and "rahukk" 因此,我需要识别“说”标签并在开始标签和结束标签之间打印文本,例如:“美国国家银行”和“ rahukk”
So this is what I have done till now: 所以这是我到目前为止所做的:
function countString(document_r,a,b) {
var test = document_r.body;
var text = typeof test.textContent == 'string'? test.textContent : test.innerText;
var testRE = text.match(a+"(.*)"+b);
return testRE[1];
}
chrome.extension.sendMessage({
action: "getSource",
source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document,'<accountName>','</accountName>')
});
But this only prints the innertext of only the first tag it encounters in the page ie "State bank of america". 但这仅打印页面中遇到的第一个标签的内文,即“美国国家银行”。
What if I want to print only "rahukk" which is the innertext of last tag in the page or both. 如果我只想打印“ rahukk”(那是页面中最后一个标签的内文)或两者都打印该怎么办。
How do I print the innertext of last tag it encounters in the page or how does it print all the tags ? 如何打印页面中遇到的最后一个标签的内部文本,或者如何打印所有标签?
Thanks in advance. 提前致谢。
EDIT : The document above itself is an HTML page i have just put the contents of the page 编辑:上面的文档本身是一个HTML页面,我刚刚将页面内容
UPDATE : So I did some here and there from the suggestions below and the best I could reach by this code : 更新:所以我从下面的建议中到处都做了一些,我可以通过这段代码达到最好的效果:
function countString(document_r) {
var test = document_r.body;
var text = test.innerText;
var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var regexg = new RegExp(regex,"g");
var testRE = text.match(regexg);
return testRE;
}
chrome.extension.sendMessage({
action: "getSource",
source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document)
});
But this gave me : 但这给了我:
XML DETAILS>>>>> Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)),Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP)),Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP))
XML详细信息>>>>>退休计划(利润分享退休计划(PSRP)和货币购买退休金计划(MPPP)),退休计划(利润分享退休计划(PSRP)和货币购买退休金计划(MPPP)),退休计划(利润分享退休计划(PSRP)和货币购买退休金计划(MPPP))
This again because the same XML was present in the page 3 times and What I want is that regex to match only from the last XML and I don't want the tag names too. 再次出现这种情况是因为该页面中存在3次相同的XML,而我想要的是该正则表达式仅与最后一个XML相匹配,并且我也不想标记名称。
So my desired output would be: 所以我想要的输出将是:
XML DETAILS>>>>> Retirement Program (Profit-Sharing Retirement Plan (PSRP) and Money Purchase Pension Plan (MPPP))
XML详细信息>>>>>退休计划(利润分享退休计划(PSRP)和货币购买退休金计划(MPPP))
Regex pattern like this: <accountName>(.*?)<\\/accountName>
这样的正则表达式模式:
<accountName>(.*?)<\\/accountName>
var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var testRE = text.match(regex);
=> testRE contains all your matches, in case of tag=accountName it contains "state bank of america" and "rahukk" => testRE包含您的所有匹配项,如果tag = accountName,则其中包含“美国国家银行”和“ rahukk”
UPDATE 更新
According to this page to receive all matches, instead of only the first one, you smust add a "g" flag to the match pattern. 根据接收所有匹配的页面 ,您必须在匹配模式中添加一个“ g”标志,而不是仅接收第一个。
"g: The global search flag makes the RegExp search for a pattern throughout the string, creating an array of all occurrences it can find matching the given pattern."
“ g:全局搜索标志使RegExp在整个字符串中搜索一个模式,创建一个可以找到与给定模式匹配的所有匹配项的数组。” found here
在这里找到
Hope this helps you! 希望这对您有所帮助!
you match method is not global. 您匹配的方法不是全局的。
var regex = new RegExp(a+"(.*)"+b, "g");
text.match(regex);
If the full XML string is valid, you can parse it into an XML document using the DOMParser.parseFromString
method : 如果完整的XML字符串有效,则可以使用
DOMParser.parseFromString
方法将其解析为XML文档:
var xmlString = '<root>[Valid XML string]</root>';
var parser = new DOMParser();
var doc = parser.parseFromString(xmlString, 'text/xml');
Then you can get a list of tags with a specified name directly: 然后,您可以直接获得具有指定名称的标签列表:
var found = doc.getElementsByTagName('tagName');
Here's a jsFiddle example using the XML you provided, with two minor tweaks—I had to add a root
element and an opening tag for the first site
. 这是一个使用您提供的XML的jsFiddle示例 ,进行了两个小调整-我必须为第一个
site
添加一个root
元素和一个开始标记。
You don't need regular expressions for your task (besides, read RegEx match open tags except XHTML self-contained tags for why it's not a good idea!). 您无需为任务使用正则表达式(此外,请阅读RegEx匹配打开标签,但XHTML自包含标签除外,以了解为什么它不是一个好主意!)。 You can do this completely via javascript:
您可以通过javascript完全完成此操作:
var tag = "section";
var targets = document.getElementsByTagName(tag);
for (var i = targets.length; i > 0; i--) {
console.log(targets[i].innerText);
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.