[英]Extract domain name from URL using python's re regex
I want to input a URL and extract the domain name which is the string that comes after http:// or https:// and contains strings, numbers, dots, underscores, or dashes.我想输入一个 URL 并提取域名,即 http:// 或 https:// 之后的字符串,包含字符串、数字、点、下划线或破折号。
I wrote the regex and used the python's re
module as follows:我编写了正则表达式并使用了 python 的
re
模块,如下所示:
import re
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
m.group(1)
print(m)
My understanding is that m.group(1)
will extract the part between () in the re.search.我的理解是
m.group(1)
会在 re.search 中提取 () 之间的部分。
The output that I expect is: google.co.uk
But I am getting this:我期望的输出是:
google.co.uk
但我得到的是:
<_sre.SRE_Match object; span=(0, 35), match='https://google.co.uk?link=something'>
Can you point to me how to use re
to achieve my requirement?你能告诉我如何使用
re
来实现我的要求吗?
You need to write你需要写
print(m.group(1))
Even better yet - have a condition before:更好的是 - 之前有一个条件:
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
if m:
print(m.group(1))
Jan has already provided solution for this. Jan 已经为此提供了解决方案。 But just to note, we can implement the same without using
re
.但请注意,我们可以在不使用
re
的情况下实现相同的功能。 All it needs is ,"#$%&\'()*+.-:/;?<=>?@[\\]^_`{|}~
for validation purposes. The same can be obtained from string
package.它所需要的只是
,"#$%&\'()*+.-:/;?<=>?@[\\]^_`{|}~
用于验证目的。同样可以从string
包中获得.
def domain_finder(link):
import string
dot_splitter = link.split('.')
seperator_first = 0
if '//' in dot_splitter[0]:
seperator_first = (dot_splitter[0].find('//') + 2)
seperator_end = ''
for i in dot_splitter[2]:
if i in string.punctuation:
seperator_end = i
break
if seperator_end:
end_ = dot_splitter[2].split(seperator_end)[0]
else:
end_ = dot_splitter[2]
domain = [dot_splitter[0][seperator_first:], dot_splitter[1], end_]
domain = '.'.join(domain)
return domain
link = 'https://google.co.uk?link=something'
domain = domain_finder(link=link)
print(domain) # prints ==> 'google.co.uk'
This is just another way of solving the same without re
.这只是不用
re
解决相同问题的另一种方法。
There is an library called tldextract which is very reliable in this case.在这种情况下,有一个名为tldextract的库非常可靠。
Here is how it will work这是它的工作原理
import tldextract
def extractDomain(url):
if "http" in str(url) or "www" in str(url):
parsed = tldextract.extract(url)
parsed = ".".join([i for i in parsed if i])
return parsed
else: return "NA"
op = open("out.txt",'w')
# with open("test.txt") as ptr:
# for lines in ptr.read().split("\n"):
# op.write(str(extractDomain(lines)) + "\n")
print(extractDomain("https://test.pythonhosted.org/Flask-Mail/"))
output as follows,输出如下,
test.pythonhosted.org
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.