从HTML标记中删除* JS事件属性

Question

Please, help to parse in PHP simple html strings (php regexp). 请帮助解析PHP简单的html字符串（php regexp）。 I need drop html-js events from html code. 我需要从html代码中删除html-js事件。 I know php regular expressions very bad. 我知道php正则表达式非常糟糕。

Examples of code: 代码示例：

<button onclick="..javascript instruction..">

Result: <button> 结果： <button>

<button onclick="..javascript instruction.." value="..">

Result: <button value=".."> 结果： <button value="..">

<button onclick=..javascript instruction..>

Result: <button> 结果： <button>

<button onclick=..javascript instruction.. value>

Result: <button value> 结果： <button value>

I need to do this without quotes and with, because all modern browsers is allow to do attributes without quoutes. 我需要在没有引号的情况下执行此操作，因为所有现代浏览器都允许在没有quoutes的情况下执行属性。

Note: I nedd parse not only onclick.. it is all atrributes, which begins from 'on'. 注意：我不仅解析了onclick ..这是所有的atrributes，从'on'开始。

Note (2): DONT TRY TO ADVICE HTML PARSER, BECAUSE IT WILL BE VERY BIG DOM TREE FOR PARSE.. 注意（2）：不要尝试建议HTML PARSER，因为它将是非常大的DOM树...

UPDATED : Thanks, for your reply! 更新：谢谢，谢谢你的回复！ Now, i use HTMLPurifier component on written by me a small framework. 现在，我使用HTMLPurifier组件编写一个小框架。

Answer 1

There is nothing wrong with tokenizing with regex. 使用正则表达式进行标记时没有任何问题。 But making a full HTML tokenizer with regex is a lot of work and difficult to get right. 但是使用正则表达式创建一个完整的HTML标记化器是很多工作，很难做到正确。 I would recommend using a proper parser, because you will probably need to remove script tags and such anyway. 我建议使用正确的解析器，因为您可能还需要删除脚本标记等。

Assuming a full tokenizer is not needed, the following regex and code can be used to remove on* attributes from HTML tags. 假设不需要完整的标记化器，可以使用以下正则表达式和代码从HTML标记中删除on*属性。 Because a proper tokenizer is not used, it would match strings that look like tags even in scripts, comments, CDATA, etc. 因为没有使用正确的标记化器，所以即使在脚本，注释，CDATA等中，它也会匹配看起来像标签的字符串。

There is no guarantee that all event attributes will be removed for all input/browser combinations! 无法保证所有输入/浏览器组合都会删除所有事件属性！ See the notes below. 请参阅下面的注释。

Note on error tolerance : 关于容错的注意事项 ：

Browsers are usually forgiving of errors. 浏览器通常容忍错误。 Due to that it is difficult to tokenize tags and get the attributes as the browser would see them when "invalid" data is present. 由于难以对标签进行标记并获取属性，因为当存在“无效”数据时，浏览器会看到它们。 Because error tolerance and handling differs between browsers it is impossible to make a solution that works for them all in all cases. 由于浏览器之间的容错和处理不同，因此无法在所有情况下制定适用于它们的解决方案。

Thus : Some browser(s) (current, past, or future version) could treat something which my code does not think is a tag, as a tag, and execute the JS code. 因此：某些浏览器（当前版本，过去版本或未来版本）可以将我的代码认为不是标记的东西视为标记，并执行JS代码。

In my code I have attempted to mimic tokenization of tags (and error tolerance/handling) of recent Google Chrome versions. 在我的代码中，我试图模仿最近谷歌Chrome版本的标签（和容错/处理）的标记化。 Firefox seems to do it in a similar way. Firefox似乎以类似的方式做到这一点。

IE 7 differs, in some cases it's not as tolerant (which is better than if it was more tolerant). IE 7有所不同，在某些情况下它不具有宽容性（这比它更宽容时更好）。 (IE 6 - lets not go there. See XSS Filter Evasion Cheat Sheet ) （IE 6 - 不要去那里。参见XSS Filter Evasion Cheat Sheet ）

Relevant links: 相关链接：

The code 编码

$redefs = '(?(DEFINE)
    (?<tagname> [a-z][^\s>/]*+    )
    (?<attname> [^\s>/][^\s=>/]*+    )  # first char can be pretty much anything, including =
    (?<attval>  (?>
                    "[^"]*+" |
                    \'[^\']*+\' |
                    [^\s>]*+            # unquoted values can contain quotes, = and /
                )
    ) 
    (?<attrib>  (?&attname)
                (?: \s*+
                    = \s*+
                    (?&attval)
                )?+
    )
    (?<crap>    [^\s>]    )             # most crap inside tag is ignored, will eat the last / in self closing tags
    (?<tag>     <(?&tagname)
                (?: \s*+                # spaces between attributes not required: <b/foo=">"style=color:red>bold red text</b>
                    (?>
                        (?&attrib) |    # order matters
                        (?&crap)        # if not an attribute, eat the crap
                    )
                )*+
                \s*+ /?+
                \s*+ >
    )
)';


// removes onanything attributes from all matched HTML tags
function remove_event_attributes($html){
    global $redefs;
    $re = '(?&tag)' . $redefs;
    return preg_replace("~$re~xie", 'remove_event_attributes_from_tag("$0")', $html);
}

// removes onanything attributes from a single opening tag
function remove_event_attributes_from_tag($tag){
    global $redefs;
    $re = '( ^ <(?&tagname) ) | \G \s*+ (?> ((?&attrib)) | ((?&crap)) )' . $redefs;
    return preg_replace("~$re~xie", '"$1$3"? "$0": (preg_match("/^on/i", "$2")? " ": "$0")', $tag);
}

Example usage 用法示例

Online example : 在线示例：

$str = '
<button onclick="..javascript instruction..">
<button onclick="..javascript instruction.." value="..">
<button onclick=..javascript_instruction..>
<button onclick=..javascript_instruction.. value>
<hello word "" ontest = "hai"x="y"onfoo=bar/baz  />
';

echo $str . "\n----------------------\n";

echo remove_event_attributes($str);

Output: 输出：

<button onclick="..javascript instruction..">
<button onclick="..javascript instruction.." value="..">
<button onclick=..javascript_instruction..>
<button onclick=..javascript_instruction.. value>
<hello word "" ontest = "hai"x="y"onfoo=bar/baz  />

----------------------

<button >
<button  value="..">
<button >
<button  value>
<hello word "" x="y"   />

Answer 2

You might be better off using DOMDocument. 你可能最好使用DOMDocument。

You can use it to iterate over the DOM tree represented by the HTML file you're trying to parse, looking for the various on* attributes that you want to remove. 您可以使用它来迭代您尝试解析的HTML文件所代表的DOM树，查找要删除的各种on *属性。

This approach is more likely to succeed because DOMDocument actually understands the semantics of a HTML file, whereas regex is just a dumb string parser and inadequate for reliably parsing HTML. 这种方法更有可能成功，因为DOMDocument实际上理解HTML文件的语义，而正则表达式只是一个愚蠢的字符串解析器，不足以可靠地解析HTML。

从HTML标记中删除* JS事件属性

问题描述

2 个解决方案

解决方案1
5 已采纳 2012-02-27 13:54:15

The code 编码

Example usage 用法示例

解决方案2
4 2012-02-27 08:43:36

从HTML标记中删除* JS事件属性

问题描述

2 个解决方案

解决方案1 5 已采纳 2012-02-27 13:54:15

The code 编码

Example usage 用法示例

解决方案2 4 2012-02-27 08:43:36

解决方案1
5 已采纳 2012-02-27 13:54:15

解决方案2
4 2012-02-27 08:43:36