使用 python 的正则表达式从 URL 中提取域名

Question

I want to input a URL and extract the domain name which is the string that comes after http:// or https:// and contains strings, numbers, dots, underscores, or dashes.我想输入一个 URL 并提取域名，即 http:// 或 https:// 之后的字符串，包含字符串、数字、点、下划线或破折号。

I wrote the regex and used the python's re module as follows:我编写了正则表达式并使用了 python 的re模块，如下所示：

import re
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
m.group(1)
print(m)

My understanding is that m.group(1) will extract the part between () in the re.search.我的理解是m.group(1)会在 re.search 中提取 () 之间的部分。

The output that I expect is: google.co.uk But I am getting this:我期望的输出是： google.co.uk但我得到的是：

<_sre.SRE_Match object; span=(0, 35), match='https://google.co.uk?link=something'>

Can you point to me how to use re to achieve my requirement?你能告诉我如何使用re来实现我的要求吗？

Answer 1

You need to write你需要写

print(m.group(1))

Even better yet - have a condition before:更好的是 - 之前有一个条件：

m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
if m:
    print(m.group(1))

Answer 2

Jan has already provided solution for this. Jan 已经为此提供了解决方案。 But just to note, we can implement the same without using re .但请注意，我们可以在不使用re的情况下实现相同的功能。 All it needs is ,"#$%&\'()*+.-:/;?<=>?@[\\]^_`{|}~ for validation purposes. The same can be obtained from string package.它所需要的只是,"#$%&\'()*+.-:/;?<=>?@[\\]^_`{|}~用于验证目的。同样可以从string包中获得.

def domain_finder(link):
    import string
    dot_splitter = link.split('.')

    seperator_first = 0
    if '//' in dot_splitter[0]:
        seperator_first = (dot_splitter[0].find('//') + 2)

    seperator_end = ''
    for i in dot_splitter[2]:
        if i in string.punctuation:
            seperator_end = i
            break

    if seperator_end:
        end_ = dot_splitter[2].split(seperator_end)[0]
    else:
        end_ = dot_splitter[2]

    domain = [dot_splitter[0][seperator_first:], dot_splitter[1], end_]
    domain = '.'.join(domain)

    return domain

link = 'https://google.co.uk?link=something'
domain = domain_finder(link=link)
print(domain) # prints ==> 'google.co.uk'

This is just another way of solving the same without re .这只是不用re解决相同问题的另一种方法。

Answer 3

There is an library called tldextract which is very reliable in this case.在这种情况下，有一个名为tldextract的库非常可靠。

Here is how it will work这是它的工作原理

import tldextract

def extractDomain(url):
    if "http" in str(url) or "www" in str(url):
        parsed = tldextract.extract(url)
        parsed = ".".join([i for i in parsed if i])
        return parsed
    else: return "NA"

op = open("out.txt",'w')
# with open("test.txt") as ptr:
#   for lines in ptr.read().split("\n"):
#       op.write(str(extractDomain(lines)) + "\n")

print(extractDomain("https://test.pythonhosted.org/Flask-Mail/"))

output as follows,输出如下，

test.pythonhosted.org

使用 python 的正则表达式从 URL 中提取域名

问题描述

3 个解决方案

解决方案1
6 已采纳 2019-04-26 06:35:25

解决方案2
1 2020-07-01 07:27:36

解决方案3
0 2019-04-26 08:36:44

使用 python 的正则表达式从 URL 中提取域名

问题描述

3 个解决方案

解决方案1 6 已采纳 2019-04-26 06:35:25

解决方案2 1 2020-07-01 07:27:36

解决方案3 0 2019-04-26 08:36:44

解决方案1
6 已采纳 2019-04-26 06:35:25

解决方案2
1 2020-07-01 07:27:36

解决方案3
0 2019-04-26 08:36:44