Remove remote content links in HTML using javascript

Question

I have to scan an HTML for remote content (Iframe tags, Img tags ,Script tags etc) and remove the links present in them based on certain blacklist. I am able to remove Iframe ,img , script tags whose src points to a Blacklisted URL.

var mySpan = document.createElement(\"span\");
 mySpan.innerHTML = \"\";
 var block = p[key];
 var re = new RegExp(block);
 a = document.getElementsByTagName('iframe');
 for(i=0;i<a.length;i++)
 {
    var str = a.item(i).src;
    if(str.match(re))
     {

          a[i].parentNode.replaceChild(mySpan, a[i]);
        // + "a.item(i).src = '';
    }
 }

Similarly for script and img tags . But there can be many more such tags. Can i have a generic solution to traverse all tags in HTML and find/replace links that are blacklisted I am very new to Javascript so a bit weak in its basics. Can this solution work in my case ? I dont want to use JQuery etc libraries as i am doing this on Android.

Answer 1

Get all elements in the document document.getElementsByTagName('*')

Once you do that use what ever code you find suitable to check each element for your condition.

This will make sure that you have checked everything, if you were using jQuery i could make thinks simpler.

But much respect for being a pure JavaScripter !

Answer 2

Don't use any regexp on HTML - use DOM.

Review HTML standard for list of attributes on tags that can contain external links.
Loop over collections returned from document.getElementsByTagName(tagname) .
Check attribute against blacklist and clean-up with .getAttribute and .removeAttribte (bonus: you will have normalized data, no need to worry about people trying to sneak by with funky escaping!).
Many of those attributes will be called src , so you might want to loop over tag name "*" with this attribute just to be little future-proof/paranoid. Or just loop over all attributes on all elements. This will be very slow though and still don't guarantee that somebody won't avoid it with using URLs that hard to distinguish from plain text (like IP or domain name without protocol), so I recommend against full scan.

Remove remote content links in HTML using javascript

Question

2 answers

solution1
2 2012-08-03 08:10:08

solution2
2 2012-08-03 10:03:28

Remove remote content links in HTML using javascript

Question

2 answers

solution1 2 2012-08-03 08:10:08

solution2 2 2012-08-03 10:03:28

solution1
2 2012-08-03 08:10:08

solution2
2 2012-08-03 10:03:28