简体   繁体   中英

Javascript regex whitespace is being wacky

I'm trying to write a regex that searches a page for any script tags and extracts the script content, and in order to accommodate any HTML-writing style, I want my regex to include script tags with any arbitrary number of whitespace characters (eg <script type = blahblah> and <script type=blahblah> should both be found). My first attempt ended up with funky results, so I broke down the problem into something simpler, and decided to just test and play around with a regex like /\\s*h\\s*/g.

When testing it out on string, for some reason completely arbitrary amounts of whitespace around the 'h' would be a match, and other arbitrary amounts wouldn't, eg something like " h " would match but " h " wouldn't. Does anyone have an idea of why this occurring or the the error I'm making?

Since you're using JavaScript, why can't you just use getElementsByTagName('script') ? That's how you should be doing it.

If you somehow have an HTML string, create an iframe and dump the HTML into it, then run getElementsByTagName('script') on it.

OK, to extend Kolink's answer, you don't need an iframe, or event handlers:

var temp = document.createElement('div');
temp.innerHTML = otherHtml;
var scripts = temp.getElementsByTagName('script');

... now scripts is a DOM collection of the script elements - and the script doesn't get executed ...


Why regex is not a fantastic idea for this:

As a <script> element may not contain the string </script> anywhere, writing a regex to match them would not be difficult: /<script[.\\n]+?<\\/script>/gi

It looks like you want to only match scripts with a specific type attribute. You could try to include that in your pattern too: /<script[^>]+type\\s*=\\s*(["']?)blahblah\\1[.\\n]*?<\\/script>/gi - but that is horrible. (That's what happens when you use regular expressions on irregular strings, you need to simplify)

So instead you iterate through all the basic matched scripts, extract the starting tag: result.match(/<script[^>]*>/i)[0] and within that, search for your type attribute /type\\s*=\\s*((["'])blahblah\\2|\\bblahblah\\b)/.test(startTag) . Oh look - it's back to horrible - simplify!

This time via normalisation: startTag = startTag.replace(/\\s*=\\s*/g, '=').replace(/=([^\\s"'>]+)/g, '="$1"') - now you're in danger territory, what if the = is inside a quoted string? Can you see how it just gets more and more complicated?

You can only have this work using regex if you make robust assumptions about the HTML you'll use it on (ie to make it regular). Otherwise your problems will grow and grow and grow!

  • disclaimer: I haven't tested any of the regex used to see if they do what I say they do, they're just example attempts.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM