简体   繁体   中英

Match domain name from url (www.google.com=google)

So I want to match just the domain from ether:

http://www.google.com/test/
http://google.com/test/
http://google.net/test/

google 谷歌

I got this code working for just .com

echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.com.*$/\1/p"
Output: 'google'

Then I thought it would be as simple as doing say (com|net) but that doesn't seem to be true:

echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.(com|net).*$/\1/p"
Output: '' (nothing)

I was going to use a similar method to get rid of the "www" but it seems im doing something wrong… (does it not work with regex outside the \\( \\) …)

if you have Python, you can use urlparse module

import urlparse
for http in open("file"):
    o = urlparse.urlparse(http)
    d = o.netloc.split(".")
    if "www" in o.netloc:
        print d[1]
    else:
        print d[0]

output

$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/

$ ./python.py
google
google
google

or you can use awk

awk -F"/" '{
    gsub(/http:\/\/|\/.*$/,"")
    split($0,d,".")
    if(d[1]~/www/){
        print d[2]
    }else{
        print d[1]
    }
} ' file

$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
www.google.com.cn/test
google.com/test

$ ./shell.sh
google
google
google
google
google

This will output "google" in all cases:

sed -n "s|http://\(.*\.\)*\(.*\)\..*|\2|p"

Edit:

This version will handle URLs like "' http://google.com.cn/test " and " http://www.google.co.uk/ " as well as the ones in the original question:

sed -nr "s|http://(www\.)?([^.]*)\.(.*\.?)*|\2|p"

This version will handle cases that don't include "http://" (plus the others):

sed -nr "s|(http://)?(www\.)?([^.]*)\.(.*\.?)*|\3|p"
s|http://(www\.)?([^.]*)|$2|

这是带有备用分隔符的Perl(因为它使它更清晰),我相信你可以将它移植到sed或任何你需要的东西。

Have you tried using the "-r" switch on your sed command? This enables the extended regular expression mode (egrep-compatible regexes).

Edit: try this, it seems to work. The "?:" characters in front of com|net are to prevent this set of characters to be captured by their surrounding parenthesis.

 echo "http://www.google.com/test/" | sed -nr "s/.*www\.(.*)\.(?:com|net).*$/\1/p"
#! /bin/bash

urls=(                        \
  http://www.google.com/test/ \
  http://google.com/test/     \
  http://google.net/test/     \
)

for url in ${urls[@]}; do
  echo $url | sed -re 's,^http://(.*\.)*(.+)\.[a-z]+/.+$,\2,'
done

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM