Get only domain name from urls using urlsplit

Question

I have a dataset with urls in different forms (eg https://stackoverflow.com, https://www.stackoverflow.com, stackoverflow.com ) and I need to have only domain name like stackoverflow .

I used a parse.urlsplit(url) from urllib but it's not working well in my case.

How can I get only the domain name?

edit.:

My code :

def normalization (df):
  df['after_urlsplit'] = df["httpx"].map(lambda x: parse.urlsplit(x))
  return df

normalization(df_sample)

output:

            httpx                       after_urlsplit
0   https://stackoverflow.com/       (https, stackoverflow.com, /, , )
1   https://www.stackoverflow.com/   (https, www.stackoverflow.com, /, , )
2   www.stackoverflow.com/           (, , www.stackoverflow.com/, , )
3   stackoverflow.com/               (, , stackoverflow.com/, , )

Answer 1

New answer, working for urls and host names too

To handle instances where there is no protocol definition (eg example.com ) it is better to use a regex:

import re

urls = ['www.stackoverflow.com',
        'stackoverflow.com',
        'https://stackoverflow.com',
        'https://www.stackoverflow.com/',
        'www.stackoverflow.com',
        'stackoverflow.com',
        'https://subdomain.stackoverflow.com/']

for url in urls:
    host_name = re.search("^(?:.*://)?(.*)$", url).group(1).split('.')[-2]
    print(host_name)

This prints stackoverflow in all cases.

Old answer, working for urls only

You can use the value of netloc returned by the urlsplit, additionally with some extra tailoring to get the domain (part) you want:

from urllib.parse import urlsplit

m = urlsplit('http://subdomain.example.com/some/extra/things')

print(m.netloc.split('.')[-2])

This prints example .

(However, this would fail on urls like http://localhost/some/path/to/file.txt )

Answer 2

You can use regular expression(regex) for this mission.

import re

URL = "https://www.test.com"
result = re.search("https?:\/\/(www.)?([\w\.\_]+)", URL)
print(result.group(2))

# output: test.com

Answer 3

处理此类问题的最佳方法是使用regex 。

Get only domain name from urls using urlsplit

Question

3 answers

solution1
1 ACCPTED 2019-12-28 21:59:40

New answer, working for urls and host names too

Old answer, working for urls only

solution2
1 2019-12-28 22:05:42

solution3
0 2019-12-28 22:04:20

Get only domain name from urls using urlsplit

Question

3 answers

solution1 1 ACCPTED 2019-12-28 21:59:40

New answer, working for urls and host names too

Old answer, working for urls only

solution2 1 2019-12-28 22:05:42

solution3 0 2019-12-28 22:04:20

solution1
1 ACCPTED 2019-12-28 21:59:40

solution2
1 2019-12-28 22:05:42

solution3
0 2019-12-28 22:04:20