What is the right way to safely and accurately insert user-provided URL data into an HTML5 document?

Question

Given an arbitrary customer input in a web form for a URL, I want to generate a new HTML document containing that URL within an href . My question is how am I supposed to protect that URL within my HTML.

What should be rendered into the HTML for the following URLs that are entered by an unknown end user:

http://example.com/?file=some_19%affordable.txt
http://example.com/url?source=web&last="foo"&bar=<
https://www.google.com/url?source=web&sqi=2&url=https%3A%2F%2Ftwitter.com%2F%3Flang%3Den&last=%22foo%22

If we assume that the URLs are already uri-encoded, which I think is reasonable if they are copying it from a URL bar, then simply passing it to attr() produces a valid URL and document that passes the Nu HTML checker at validator.w3.org/nu.

To see it in action, we set up a JS fiddle at https://jsfiddle.net/kamelkev/w8ygpcsz/2/ where replacing the URLs in there with the examples above can show what is happening.

For future reference, this consists of an HTML snippet

<a>My Link</a>

and this JS:

$(document).ready(function() {
 $('a').attr('href', 'http://example.com/request.html?data=&gt;');
 $('a').attr('href2', 'http://example.com/request.html?data=<');
 alert($('a').get(0).outerHTML);
});

So with URL 1, it is not possible to tell if it is URI encoded or not by looking at it mechanically. You can surmise based on your human knowledge that it is not, and is referring to a file named some_19%affordable.txt . When run through the fiddle, it produces

<a href="http://example.com/?file=some_19%affordable.txt">My Link</a>

Which passes the HTML5 validator no problem. It likely is not what the user intended though.

The second URL is clearly not URI encoded. The question becomes what is the right thing to put into the HTML to prevent HTML parsing problems.

Running it thru the fiddle, Safari 10 produces this:

<a href="http://example.com/url?source=web&amp;last=&quot;f o o&quot;&amp;bar=&lt;">My Link</a>

and pretty much every other browser produces this:

<a href="http://example.com/url?source=web&amp;last=&quot;f o o&quot;&amp;bar=<">My Link</a>

Neither of these passes the validator. Three complaints are possible: the literal double quote (from un-escaping HTML), the spaces, or the trailing < character (also from un-escaping HTML). It just shows you the first of these it finds. This is clearly not valid HTML.

Two ways to try to fix this are a) html-escape the URL before giving it to attr() . This however results in every & becoming & and the entities such as & and < become double-escaped by attr() , and the URL in the document is entirely inaccurate. It looks like this:

<a href="http://example.com/url?source=web&amp;amp;last=&amp;quot;f+o+o&amp;quot;&amp;amp;bar=&amp;lt;">My Link</a>

The other is to URI-encode it before passing to attr() , which does result in a proper validating URL which actually clicks to the intended destination. It looks like this:

<a href="http://example.com/url?source=web&amp;last=%22f%20o%20o%22&amp;bar=%3C">My Link</a>

Finally, for the third URL, which is properly URI encoded, the proper HTML that validates does come out.

<a href="https://www.google.com/url?source=web&amp;sqi=2&amp;url=https%3A%2F%2Ftwitter.com%2F%3Flang%3Den&amp;last=%22foo%22">My Link</a>

and it does what the user would expect to happen when clicked.

Based on this, the algorithm should be:

if url is encoded then
 pass as-is to attr()
else
 pass encodeURI(url) to attr()

however, the "is encoded" test seems to be impossible to detect in the affirmative based on these two prior discussions (indeed, see example URL 1):

How to find out if string has already been URL encoded? How to know if a URL is decoded/encoded?

If we bypass the attr() method and forcibly insert the HTML-escaped version of example URL 2 into the document structure, it would look like this:

<a href="http://example.com/url?source=web&amp;last=&quot;f+o+o&quot;&amp;bar=&lt;">My Link</a>

Which seemingly looks like valid HTML, yet fails the HTML5 validator because it unescapes to have invalid URL characters. The browsers, however, don't seem to mind it. Unfortunately, if you do any other manipulation of the object, the browser will re-escape all the & 's anyway.

As you can see, this is all very confusing. This is the first time we're using the browser itself to generate the HTML, and we are not sure if we are getting it right. Previously, we did it server side using templates, and only did the HTML-escape filter.

What is the right way to safely and accurately insert user-provided URL data into an HTML5 document (using JavaScript)?

Answer 1

If you can assume the URL is either encoded or not encoded, you may be able to get away with something along the lines of this. Try to decode the URL, treat an error as the URL not being encoded and you should be left with a decoded URL.

<script>
var inputurl = 'http://example.com/?file=some_19%affordable.txt';
var myurl;

try {
    myurl = decodeURI(inputurl);
}
catch(error) {
    myurl = inputurl;
}

console.log(myurl);
</script>

What is the right way to safely and accurately insert user-provided URL data into an HTML5 document?

Question

1 answers

solution1
0 2016-09-30 19:14:06

What is the right way to safely and accurately insert user-provided URL data into an HTML5 document?

Question

1 answers

solution1 0 2016-09-30 19:14:06

solution1
0 2016-09-30 19:14:06