简体   繁体   English

如何提取域名?

[英]How do I extract the domain name?

I am trying to extract the domain name from various websites.我正在尝试从各种网站中提取域名。 Here are the websites :以下是网站

1. "www.xakep.ru"  should equal "xakep"

2. "http://www.fk3vmxex20vzn4ddp.info/default.html" should equal "fk3vmxex20vzn4ddp"

3. "https://hxin2wz7bkx9oicndd28y6m6i7n.us/img/" should equal "hxin2wz7bkx9oicndd28y6m6i7n"

4. "iccan.org" should equal "iccan"

5. "0iwb0awri.br/warez/" should equal "0iwb0awri"

6. "http://www.google.com/" should equal "google"

My code:我的代码:

import re
url = "www.xakep.ru"
regex = re.compile(r'(://|www.)+([a-zA-Z-_0-9]+)')
match = regex.search(url)
print(match.group(2))

I am having problem in string without http or www in them.我在没有httpwww 的字符串中遇到问题。

You may use this regex with 2 optional matches:您可以将此正则表达式与 2 个可选匹配项一起使用:

^(?:https?://)?(?:www\.)?([^.]+)

RegEx Demo正则表达式演示

RegEx Details:正则表达式详情:

  • ^ : Start ^ : 开始
  • (?:https?://)? : optionally match http:// or https:// :可选匹配http://https://
  • (?:www\\.)? : optionally match www. :可选匹配www.
  • ([^.]+) : Match 1+ of any character that is not a DOT in capture group #1 ([^.]+) : 匹配第 1 号捕获组中任何不是 DOT 的字符的 1+

I know that you asked for using RE for that, but normally I'd not recommend to do such thing "manually", because it is easy to get it wrong.我知道您为此要求使用 RE,但通常我不建议“手动”执行此类操作,因为很容易出错。

The function you are looking for is in python's urllib and should provide everything you want: https://docs.python.org/3/library/urllib.parse.html你正在寻找的函数在 python 的 urllib 中,应该提供你想要的一切: https : //docs.python.org/3/library/urllib.parse.html

When you get the hostname from the urlsplit function, getting the domain name from that is much easier than trying to parse any URL.当您从 urlsplit 函数获取主机名时,从中获取域名比尝试解析任何 URL 容易得多。 But then, I might be lazy here.但是,我可能在这里很懒惰。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM