简体   繁体   中英

JavaScript to remove whatever is after the tld and before the whitespace

I have a bunch of functions that are filtering a page down to the domains that are attached to email addresses. It's all working great except for one small thing, some of the links are coming out like this:

EXAMPLE.COM
EXAMPLE.ORG.
EXAMPLE.ORG>.
EXAMPLE.COM"
EXAMPLE.COM".
EXAMPLE.COM).
EXAMPLE.COM(COMMENT)"
DEPT.EXAMPLE.COM
EXAMPLE.ORG
EXAMPLE.COM.

I want to figure out one last filter (regex or not) that will remove everything after the TLD. All of these items are in an array.

EDIT

The function I'm using:

function filterByDomain(array) {
    var regex = new RegExp("([^.\n]+\.[a-z]{2,6}\b)", 'gi');
    return array.filter(function(text){
        return regex.test(text);
    });
}

You can probably use this regex to match your TLD for each case:

/^[^.\n]+\.[a-z]{2,63}$/gim

RegEx Demo

You validation function can be:

function filterByDomain(array) {
    var regex = /^[^.\n]+\.[a-z]{2,63}$/gim;
    return array.filter(function(text){
        return regex.test(text);
    });
}

PS: Do read this Q & A to see that up to 63 characters are allowed in TLD.

I'd match all leading [\\w.] and omit the last dot, if any:

var result = url.match(/^[\w\.]+/).join("");
if(result.slice(-1)==".") result = result.slice(0,-1);

With note that \\w should be replaced for something more sophisticated:

  • _ is part of \\w set but should not be in url path
  • - is not part of \\w but can be in url not adjacent to . or -

To keep the regexp simple and the code readable, I'd do it this way

  1. substitute _ for # in url (both # and _ can be only after TLD)
  2. substitute - for _ ( _ is part of \\w )
  3. after the regexp test, substitute _ back for -

URL like www.-example-.com would still pass, can be detected by searching for [.-]{2,}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM