简体   繁体   中英

Get only domain name from urls using urlsplit

I have a dataset with urls in different forms (eg https://stackoverflow.com, https://www.stackoverflow.com, stackoverflow.com ) and I need to have only domain name like stackoverflow .

I used a parse.urlsplit(url) from urllib but it's not working well in my case.

How can I get only the domain name?

edit.:

My code :

def normalization (df):
  df['after_urlsplit'] = df["httpx"].map(lambda x: parse.urlsplit(x))
  return df

normalization(df_sample)

output:

            httpx                       after_urlsplit
0   https://stackoverflow.com/       (https, stackoverflow.com, /, , )
1   https://www.stackoverflow.com/   (https, www.stackoverflow.com, /, , )
2   www.stackoverflow.com/           (, , www.stackoverflow.com/, , )
3   stackoverflow.com/               (, , stackoverflow.com/, , )

New answer, working for urls and host names too

To handle instances where there is no protocol definition (eg example.com ) it is better to use a regex:

import re

urls = ['www.stackoverflow.com',
        'stackoverflow.com',
        'https://stackoverflow.com',
        'https://www.stackoverflow.com/',
        'www.stackoverflow.com',
        'stackoverflow.com',
        'https://subdomain.stackoverflow.com/']

for url in urls:
    host_name = re.search("^(?:.*://)?(.*)$", url).group(1).split('.')[-2]
    print(host_name)

This prints stackoverflow in all cases.

Old answer, working for urls only

You can use the value of netloc returned by the urlsplit, additionally with some extra tailoring to get the domain (part) you want:

from urllib.parse import urlsplit

m = urlsplit('http://subdomain.example.com/some/extra/things')

print(m.netloc.split('.')[-2])

This prints example .

(However, this would fail on urls like http://localhost/some/path/to/file.txt )

You can use regular expression(regex) for this mission.

import re

URL = "https://www.test.com"
result = re.search("https?:\/\/(www.)?([\w\.\_]+)", URL)
print(result.group(2))

# output: test.com

处理此类问题的最佳方法是使用regex

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM