简体   繁体   中英

Which regular expression to use in order to determine which characters to escape for html attributes and javascript?

I am adopting some code from Twig (a php template framework) for escaping html and js output. Now I don't entirely understand the regex they are using.

For the full Twig code:

git clone git://github.com/fabpot/Twig.git
// the code is in Core.php in the function twig_escape_filter

They use:

preg_replace_callback( '#[^a-zA-Z0-9,\._]#Su'   , '_twig_escape_js_callback'               , $string ); // for javascript
preg_replace_callback( '#[^a-zA-Z0-9,\.\-_]#Su' , '_twig_escape_html_attr_callback' , $string ); // for html attibutes

Where the callback functions will replace everything that corresponds to the negative character class.

As far as I can tell, this is equivalent (getting rid of some backslashes):

'#[^a-zA-Z0-9,._]#Su'
'#[^a-zA-Z0-9,._-]#Su'

Now we see that for javascript they allow commas, which I don't understand because a comma is a control character in a javascript context. Take this example of a comma exploit:

// say we have a function call to a javascript function like this
function ajax( timeout, onerror, onsuccess ) {...};

// now assume I get the timeout value from somewhere dodgy (in php)
$timeout = escapeJS( '1000, evilCallback, evilCallback2' );

echo "ajax( $timeout, myErrorHandler, mySuccessHandler );"

Note that javascript will happily ignore the extra parameters...

In the html attribute, the idea is to prevent closing the attribute, hence they don't allow spaces, since it is common to write attributes without quotes and in html4 it is legal as well. However, i see spaces used in attributes for giving multiple classes to an element like: <tr class="tablerow odd"> . So dissallowing spaces prevents class attributes like this from coming from a database with templates or other sources...

  1. Given that in xhtml it is forbidden to use attributes without quotes and my site generates xhtml strict doctype, can I afford to allow spaces?
  2. Should I forbid the comma for javascript?

You should use htmlspecialchars for escaping HTML and json_encode for escaping Javascript.

$timeout = json_encode('1000, evilCallback, evilCallback2');
echo "ajax( $timeout, myErrorHandler, mySuccessHandler );";

Output:

ajax( "1000, evilCallback, evilCallback2", myErrorHandler, mySuccessHandler );

In your case you should also validate the actual content of the $timeout var, or cast it to int as this:

$timeout = json_encode((int)'1000, evilCallback, evilCallback2');
echo "ajax( $timeout, myErrorHandler, mySuccessHandler );";

Output:

ajax( 1000, myErrorHandler, mySuccessHandler );

The json_encode is not really needed when you cast to int, because PHP integers are also valid JS integers, but it is a good practice to escape all your data for the appropriate context nevertheless.


Update: Regarding the Twig code you're trying to adapt, it seems that it does not produce actual Javascript literals, but escapes strings for inclusion into Javascript literals — this is apparent from the actual use of \\xHH escape codes, which in JS are valid only inside strings (and regular expressions, but that's beside the point). It should be used as this:

$timeout = escapeJS('1000, evilCallback, evilCallback2');
echo "ajax('$timeout', myErrorHandler, mySuccessHandler);";

Notice extra quotes around $timeout in the echo . This is likely done this way to allow composition of longer JS strings from multiple escaped parts, like 'foo $escaped_part1 bar $escaped_part2 baz' .

What I found on XSS (Cross Site Scripting) Prevention Cheat Sheet :

For HTML attributes:

Properly quoted attributes can only be escaped with the corresponding quote. Unquoted attributes can be broken out of with many characters, including [space] % * + , - / ; < = > ^ and |.

I suppose looking at it like that means that there is no way to get both protected against unquoted attributes and have spaces in your attributes. I suppose the escape function could add the quotes itself, but that would be inconsistent an create situations where vulues would be quoted twice, basically unquoting them... So, for now I have made two escaping functions, allowing the user to call one explicitely that allows the space, knowing that they must put quotes.

Considering javascript:

Except for alphanumeric characters, escape all characters less than 256 with the \\xHH format to prevent switching out of the data value into the script context or into another attribute. DO NOT use any escaping shortcuts like \\" because the quote character may be matched by the HTML attribute parser which runs first. These escaping shortcuts are also susceptible to "escape-the-escape" attacks where the attacker sends \\" and the vulnerable code turns that into \\" which enables the quote.

If an event handler is properly quoted, breaking out requires the corresponding quote. However, we have intentionally made this rule quite broad because event handler attributes are often left unquoted. Unquoted attributes can be broken out of with many characters including [space] % * + , - / ; < = > ^ and |. Also, a closing tag will close a script block even though it is inside a quoted string because the HTML parser runs before the JavaScript parser.

This seems to indicate that we should escape everything. I have opted to keep underscore, since that can be part of javascript names and dot in order to allow inserting numerical values with a decimal point. I hope that leaves no vulnerabilities.

I suppose the Twig code has a bug leaving that comma around and I will file a report so they can look into it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM