简体   繁体   中英

filtering <form> from html text using regular expression

I am getting an whole html page from an ajax request as text ( xmlhttp.responseText )

Then filtering the text to extract a html form from that text and everything inside that form.

I wrote an regex :

text.match(/(<form[\W\w]*<\/form>)/gim)

As i am not an expert in regex, so i cant be sure will it work in every scenario and get everything inside the form tag?

Is there a better way that i can say everything in regex? so that the regex will look like

 text.match(/(<form[__everything_syntaxt_here__]*<\/form>)/gim)

Try this:

function stripForm(s) {
  var div = document.createElement('div');
  div.innerHTML = s;
  var scripts = div.getElementsByTagName('form');
  var i = scripts.length;
  while (i--) {
    scripts[i].parentNode.removeChild(scripts[i]);
  }
  return div.innerHTML;
}
function getForm(s) {
  var div = document.createElement('div');
  div.innerHTML = s;
  var scripts = div.getElementsByTagName('form');
  var i = scripts.length;
    var ret="";
  while (i--) {
    ret += scripts[i].innerHTML;
  }
  return ret;
}
var a = 'before Form <form action="" method="post"> <input type="text" /> <input type="text" /> <input type="text" /> </form><br/> after form';
alert(getForm(a));
alert(stripForm(a));
console.log(stripForm(a));

Demo

Having to deal with IE 5 , you poor soul.

A quick answer to your question Is [\\W\\w] really the best way to match absolutely everything?

Yes , JavaScript does not support the s modifier to make . match newlines. Doing [\\W\\w] basically tells the regex: "Match anything that is a word character, or anything that isn't a word character" , you can see that absolutely every character falls in either of those categories.

But , if you want a more reliable solution to deal with <!-- html comments --> and multiple forms on a page, best approach is something like explained in this SO answer but changed for HTML.

This is what I would use:

<!--(?:(?!-->)[\w\W])*-->|(<form(?:(?:(?!<\/form>|<!--)[\w\W])|(?:<!--(?:(?!-->)[\w\W])*-->))*</form>)

正则表达式可视化

Look at the Debuggex Demo to see what matches you actually get. In JavaScript you can then expect the first capture group. If it's empty then that was just to get rid of the commented form like explained here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM