简体   繁体   中英

Remove Javascript From A URL

I'm writing a sever-side script that replaces all URLs in a body of text with <a/> tag versions (so they can be clicked).

How can I make sure that any urls I convert do not contain any XSS style javascripts in them?

I'm currently filtering for "javascript:" in the string, but I feel that is likely not sufficient..

Any modern server-side language has some sort of implementation of Markdown or other lightweight markup languages. Those markup languages replace URLs with a clickable link.

Unless you have a lot of time to spend to research about this topic and implement this script, I'd suggest to spot the best Markdown implementation in your language and dig its code, or simply use it in your code.

Markdown is usually shipped as a library; some of them let you configure what they have to process and what they have to ignore – in your case you want to process URL, ignoring any other element.

Here's an (incomplete) list of solid Markdown implementations for different languages:

You need to attribute-encode the URLs.
You should also make sure that they start with http:// or https:// .

This was taken from Kohana framework, related to XSS filtering. Not a complete answer, but might get you on the way.

// Remove javascript: and vbscript: protocols
$str = preg_replace('#([a-z]*)[\x00-\x20]*=[\x00-\x20]*([`\'"]*)[\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iu', '$1=$2nojavascript...', $str);
$str = preg_replace('#([a-z]*)[\x00-\x20]*=([\'"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iu', '$1=$2novbscript...', $str);
$str = preg_replace('#([a-z]*)[\x00-\x20]*=([\'"]*)[\x00-\x20]*-moz-binding[\x00-\x20]*:#u', '$1=$2nomozbinding...', $str);

// Only works in IE: <span style="width: expression(alert('Ping!'));"></span>
$str = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?expression[\x00-\x20]*\([^>]*+>#is', '$1>', $str);
$str = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?behaviour[\x00-\x20]*\([^>]*+>#is', '$1>', $str);
$str = preg_replace('#(<[^>]+?)style[\x00-\x20]*=[\x00-\x20]*[`\'"]*.*?s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:*[^>]*+>#ius', '$1>', $str);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM