简体   繁体   English

Javascript正则表达式空白是古怪的

[英]Javascript regex whitespace is being wacky

I'm trying to write a regex that searches a page for any script tags and extracts the script content, and in order to accommodate any HTML-writing style, I want my regex to include script tags with any arbitrary number of whitespace characters (eg <script type = blahblah> and <script type=blahblah> should both be found). 我正在尝试编写一个正则表达式,在页面中搜索任何脚本标记并提取脚本内容,并且为了适应任何HTML编写样式,我希望我的正则表达式包含任意数量的空白字符的脚本标记(例如应该找到<script type = blahblah><script type=blahblah> My first attempt ended up with funky results, so I broke down the problem into something simpler, and decided to just test and play around with a regex like /\\s*h\\s*/g. 我的第一次尝试最终得到了时髦的结果,所以我将问题分解为更简单的问题,并决定只使用像/ \\ s * h \\ s * / g这样的正则表达式来测试和使用。

When testing it out on string, for some reason completely arbitrary amounts of whitespace around the 'h' would be a match, and other arbitrary amounts wouldn't, eg something like " h " would match but " h " wouldn't. 当在字符串上测试它时,由于某种原因,'h'周围的任意数量的空白将是匹配的,而其他任意量都不会,例如“h”匹配但“h”不匹配。 Does anyone have an idea of why this occurring or the the error I'm making? 有谁知道为什么会发生这种情况或我正在犯的错误?

Since you're using JavaScript, why can't you just use getElementsByTagName('script') ? 既然你使用的是JavaScript,为什么不能只使用getElementsByTagName('script') That's how you should be doing it. 这就是你应该怎么做的。

If you somehow have an HTML string, create an iframe and dump the HTML into it, then run getElementsByTagName('script') on it. 如果您以某种方式拥有HTML字符串,请创建一个iframe并将HTML转储到其中,然后在其上运行getElementsByTagName('script')

OK, to extend Kolink's answer, you don't need an iframe, or event handlers: 好的,为了扩展Kolink的答案,你不需要iframe或事件处理程序:

var temp = document.createElement('div');
temp.innerHTML = otherHtml;
var scripts = temp.getElementsByTagName('script');

... now scripts is a DOM collection of the script elements - and the script doesn't get executed ... ...现在脚本是脚本元素的DOM集合 - 脚本不会被执行...


Why regex is not a fantastic idea for this: 为什么正则表达式不是一个很棒的主意:

As a <script> element may not contain the string </script> anywhere, writing a regex to match them would not be difficult: /<script[.\\n]+?<\\/script>/gi 由于<script>元素可能不包含字符串</script> ,因此编写正则表达式以匹配它们并不困难: </script> /<script[.\\n]+?<\\/script>/gi </script> /<script[.\\n]+?<\\/script>/gi

It looks like you want to only match scripts with a specific type attribute. 看起来您只想匹配具有特定类型属性的脚本。 You could try to include that in your pattern too: /<script[^>]+type\\s*=\\s*(["']?)blahblah\\1[.\\n]*?<\\/script>/gi - but that is horrible. (That's what happens when you use regular expressions on irregular strings, you need to simplify) 您也可以尝试将其包含在您的模式中:/< /<script[^>]+type\\s*=\\s*(["']?)blahblah\\1[.\\n]*?<\\/script>/gi - 但这太可怕了。(当你在不规则字符串上使用正则表达式时,就会发生这种情况,你需要简化)

So instead you iterate through all the basic matched scripts, extract the starting tag: result.match(/<script[^>]*>/i)[0] and within that, search for your type attribute /type\\s*=\\s*((["'])blahblah\\2|\\bblahblah\\b)/.test(startTag) . Oh look - it's back to horrible - simplify! 因此,您遍历所有基本匹配的脚本,提取起始标记: result.match(/<script[^>]*>/i)[0]并在其中搜索您的类型属性/type\\s*=\\s*((["'])blahblah\\2|\\bblahblah\\b)/.test(startTag) 。哦看 - 它回到可怕 - 简化!

This time via normalisation: startTag = startTag.replace(/\\s*=\\s*/g, '=').replace(/=([^\\s"'>]+)/g, '="$1"') - now you're in danger territory, what if the = is inside a quoted string? Can you see how it just gets more and more complicated? 这次通过规范化: startTag = startTag.replace(/\\s*=\\s*/g, '=').replace(/=([^\\s"'>]+)/g, '="$1"') - 现在你处于危险区域,如果=在引用的字符串中怎么办?你能看到它变得越来越复杂吗?

You can only have this work using regex if you make robust assumptions about the HTML you'll use it on (ie to make it regular). 如果您对将要使用它的HTML做出有力的假设(即使其成为常规),那么您只能使用正则表达式进行此工作。 Otherwise your problems will grow and grow and grow! 否则你的问题会成长,成长和成长!

  • disclaimer: I haven't tested any of the regex used to see if they do what I say they do, they're just example attempts. 免责声明:我没有测试任何正则表达式,看看他们是否做了我说他们做的事情,他们只是示例尝试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM