How do I detect malformed URLs, or URLs with javascript injected in them
'http://example.com/portal/image/user_male_portrait?img_id=755109&t=1372243875358"><script>setTimeout(function () {document.body.innerHTML = \'<img src="http://images.example.com:9191/public/rickroll.gif" style="display: block; width: 100%">\'; }, 100);</script><!--'
'http://example.com/portal/image/user_male_portrait?img_id=566203&t=1350313911834'
The first URL is malicious while the second one is not. I want to be able to flag the first one. I can use regex to match script tags I suppose but is this the way to do it with Python?
It would be really hard to do a regular expression that would know if an URL is an attempt at script injection or not. To match the example you gave, searching for <script
would be enough.
But a <script>
tag is not the only dangerous thing in HTML: consider for example the URL http://example.com/portal/image/user_male_portrait?img_id=755109&t=1372243875358" onclick="setTimeout(function () { document.body.innerHTML = '<img src="http://images.example.com:9191/public/rickroll.gif" style="display: block; width: 100%">'; }, 100);"
. There is no <script>
tag at all.
All in all, the only thing really can do in regex is to reject any URL that matches
(?i)^(?!\s*https?://)|[<>"']
That is reject any URL where there is <>"'
in bare; and reject all URLs that start with anything else than the regex https?://
(after all, even with the previous check, I could still do
javascript:alert(Object.keys({gotcha:42}))
However, if this is a case of input sanitization, then do note that one can also always percent-encode <
, >
, "
and '
in any URL without damage, so maybe:
url.replace('<', '%3C').replace('>', '%3E')\
.replace('"', '%22').replace("'", '%27')
is a more sensible thing to do (along with checking that the scheme
indeed is either "http:"
or "https:"
). Or use urllib.parse.urlparse
to split the URL into components, then decode and re-encode it, and finally use urllib.parse.urlunparse
to make it into a URL again.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.