简体   繁体   中英

How do I catch javascript code injection in URL using python?

How do I detect malformed URLs, or URLs with javascript injected in them

'http://example.com/portal/image/user_male_portrait?img_id=755109&t=1372243875358"><script>setTimeout(function () {document.body.innerHTML = \'<img src="http://images.example.com:9191/public/rickroll.gif" style="display: block; width: 100%">\'; }, 100);</script><!--'

'http://example.com/portal/image/user_male_portrait?img_id=566203&t=1350313911834'

The first URL is malicious while the second one is not. I want to be able to flag the first one. I can use regex to match script tags I suppose but is this the way to do it with Python?

It would be really hard to do a regular expression that would know if an URL is an attempt at script injection or not. To match the example you gave, searching for <script would be enough.

But a <script> tag is not the only dangerous thing in HTML: consider for example the URL http://example.com/portal/image/user_male_portrait?img_id=755109&t=1372243875358" onclick="setTimeout(function () { document.body.innerHTML = '<img src="http://images.example.com:9191/public/rickroll.gif" style="display: block; width: 100%">'; }, 100);" . There is no <script> tag at all.


All in all, the only thing really can do in regex is to reject any URL that matches

(?i)^(?!\s*https?://)|[<>"']

That is reject any URL where there is <>"' in bare; and reject all URLs that start with anything else than the regex https?:// (after all, even with the previous check, I could still do

javascript:alert(Object.keys({gotcha:42}))

However, if this is a case of input sanitization, then do note that one can also always percent-encode < , > , " and ' in any URL without damage, so maybe:

url.replace('<', '%3C').replace('>', '%3E')\
   .replace('"', '%22').replace("'", '%27')

is a more sensible thing to do (along with checking that the scheme indeed is either "http:" or "https:" ). Or use urllib.parse.urlparse to split the URL into components, then decode and re-encode it, and finally use urllib.parse.urlunparse to make it into a URL again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM