简体   繁体   中英

Url parsing in javascript and DOM

I am writing a support chat application where I want text to be parsed for urls. I have found answers for similar questions but nothing for the following.

what i have

function ReplaceUrlToAnchors(text) {
    var exp = /(\b(https?:\/\/|ftp:\/\/|file:\/\/|www.)
              [-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;
    return text.replace(exp,"<a href='$1' target='_blank'>$1</a>"); 
}

that pattern is a modified version of one i found on the internet. It includes www. in the first token, because not all urls start with protocol:// However, when www.google.com is replaced with

<a href='www.google.com' target='_blank'>www.google.com</a>

which pulls up MySite.com/webchat/wwww.google.com and I get a 404

that is my first problem, my second is...

in my script for generating messages to the log, I am forced to do it a hacky way:

var last = 0;
function UpdateChatWindow(msgArray) {

    var chat = $get("MessageLog");
    for (var i = 0; i < msgArray.length; i++) {
        var element = document.createElement("div");
        var linkified = ReplaceUrlToAnchors(msgArray[i]);
        element.setAttribute("id", last.toString());
        element.innerHTML = linkified;
        chat.appendChild(element);
        last = last + 1;
    }
}

To get the "linkified" string to render HTML out correctly I have to use the non-standard.innerHTML attribute of element. I would prefer a way were i could parse the string as tokens - text tokens and anchor tokens - and call either createTextNode or createElement("a") and stitch them together with DOM.

so question 1 is how should I go about www.site.com parsing, or even site.com? and question 2 is how would could I do this using only DOM?

Another thing you could do is this:

function ReplaceUrlToAnchors(text) {
    var exp = /(\b(https?:\/\/|ftp:\/\/|file:\/\/|www.)
              [-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;
    return text.replace(exp, function(_, url) {
      return '<a href="' +
        (/^www\./.test(url) ? "http://" + url : url) +
        'target="_blank">' +
        url +
        '</a>';
    }); 
}

That is kind-of like your solution, but it does the check for "www" URLs in that callback passed in to ".replace()".

Note that you won't be picking up "stackoverflow.com" or "newegg.com" or anything like that, which I understand may be unavoidable (and even desirable, given the false positives you'd pick up).

Here is what I came up with, perhaps someone has something better?

function replaceUrlToAnchors(text) {
    var naked = /(\b(www.)[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|](.com|.net|.org|.co.uk|.ca|.))/ig;
    text = text.replace(naked, "http://$1");

    var exp = /(\b(https?:\/\/|ftp:\/\/|file:\/\/)([-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|]))/ig;
    return text.replace(exp,"<a href='$1' target='_blank'>$3</a>"); 
}

the first regex will replace www.google.com with http://www.google.com and is good enough for what I am doing. However, I will hold off marking this as the answer because I would also like to make (www.) optional but when I do (www.)? it replaces every word with http://word/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM