简体   繁体   English

使用 python 的正则表达式从 URL 中提取域名

[英]Extract domain name from URL using python's re regex

I want to input a URL and extract the domain name which is the string that comes after http:// or https:// and contains strings, numbers, dots, underscores, or dashes.我想输入一个 URL 并提取域名,即 http:// 或 https:// 之后的字符串,包含字符串、数字、点、下划线或破折号。

I wrote the regex and used the python's re module as follows:我编写了正则表达式并使用了 python 的re模块,如下所示:

import re
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
m.group(1)
print(m)

My understanding is that m.group(1) will extract the part between () in the re.search.我的理解是m.group(1)会在 re.search 中提取 () 之间的部分。

The output that I expect is: google.co.uk But I am getting this:我期望的输出是: google.co.uk但我得到的是:

<_sre.SRE_Match object; span=(0, 35), match='https://google.co.uk?link=something'>

Can you point to me how to use re to achieve my requirement?你能告诉我如何使用re来实现我的要求吗?

You need to write你需要写

print(m.group(1))

Even better yet - have a condition before:更好的是 - 之前有一个条件:

m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
if m:
    print(m.group(1))

Jan has already provided solution for this. Jan 已经为此提供了解决方案。 But just to note, we can implement the same without using re .但请注意,我们可以在不使用re的情况下实现相同的功能。 All it needs is ,"#$%&\'()*+.-:/;?<=>?@[\\]^_`{|}~ for validation purposes. The same can be obtained from string package.它所需要的只是,"#$%&\'()*+.-:/;?<=>?@[\\]^_`{|}~用于验证目的。同样可以从string包中获得.

def domain_finder(link):
    import string
    dot_splitter = link.split('.')

    seperator_first = 0
    if '//' in dot_splitter[0]:
        seperator_first = (dot_splitter[0].find('//') + 2)

    seperator_end = ''
    for i in dot_splitter[2]:
        if i in string.punctuation:
            seperator_end = i
            break

    if seperator_end:
        end_ = dot_splitter[2].split(seperator_end)[0]
    else:
        end_ = dot_splitter[2]

    domain = [dot_splitter[0][seperator_first:], dot_splitter[1], end_]
    domain = '.'.join(domain)

    return domain

link = 'https://google.co.uk?link=something'
domain = domain_finder(link=link)
print(domain) # prints ==> 'google.co.uk'

This is just another way of solving the same without re .这只是不用re解决相同问题的另一种方法。

There is an library called tldextract which is very reliable in this case.在这种情况下,有一个名为tldextract的库非常可靠。

Here is how it will work这是它的工作原理

import tldextract

def extractDomain(url):
    if "http" in str(url) or "www" in str(url):
        parsed = tldextract.extract(url)
        parsed = ".".join([i for i in parsed if i])
        return parsed
    else: return "NA"

op = open("out.txt",'w')
# with open("test.txt") as ptr:
#   for lines in ptr.read().split("\n"):
#       op.write(str(extractDomain(lines)) + "\n")

print(extractDomain("https://test.pythonhosted.org/Flask-Mail/"))

output as follows,输出如下,

test.pythonhosted.org

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM