从字符串中获取域？ - Python

Question

i needed help.我需要帮助。 How do i get domain from a string?我如何从字符串中获取域？

For example: " Hi im Natsume, check out my site http://www.mysite.com/ "例如：“嗨，我是 Natsume，请查看我的网站http://www.mysite.com/ ”

How do i get just mysite.com ?我如何获得mysite.com ？

Output example: Output 示例：

http://www.mysite.com/ (if http entered) http://www.mysite.com/ （如果输入 http）

www.mysite.com (if http not entered) www.mysite.com（如果未输入 http）

mysite.com (if both http and www not entered) mysite.com（如果 http 和 www 均未输入）

Answer 1

Well ... You need some way to define what you consider to be something that has a "domain".嗯……您需要某种方式来定义您认为具有“域”的东西。 One approach might be to look up a regular expression for URL-matching, and apply that to the string.一种方法可能是查找 URL 匹配的正则表达式，并将其应用于字符串。 If that succeeds, you at least know that the string holds a URL, and can continue to interpret the URL in order to look for a host name, from which you can then extract the domain (possibly).如果成功，您至少知道该字符串包含一个 URL，并且可以继续解释该 URL 以查找主机名，然后您可以从中提取域（可能）。

Answer 2

myString = "Hi im Natsume, check out my site http://www.mysite.com/"
>>> a = re.search("(?P<url>https?://[^\s]+)", myString) or re.search("(?P<url>www[^\s]+)", myString)
>>> a.group("url")
'http://www.mysite.com/'
>>> myString = "Hi im Natsume, check out my site www.mysite.com/"
>>> a = re.search("(?P<url>https?://[^\s]+)", myString) or re.search("(?P<url>www[^\s]+)", myString)
>>> a.group("url")
'www.mysite.com/'

Answer 3

If all the sites had the same format, you could use a regexp like this (which work in this specific case):如果所有站点都具有相同的格式，您可以使用这样的正则表达式（在这种特定情况下有效）：

re.findall('http://www\.(\w+)\.com', url)

However you need a more complex regexp able to parse whichever url and extract the domain name.但是，您需要一个更复杂的正则表达式来解析任何 url 并提取域名。

Answer 4

If you want to use regular expression, one way could be -如果您想使用正则表达式，一种方法可能是 -

>>> s = "Hi im Natsume, check out my site http://www.mysite.com/"
>>> re.findall(r'http\:\/\/www\.([a-zA-Z0-9\.-_]*)\/', s)
['mysite.com']

..considering url ends with '/' ..考虑 url 以 '/' 结尾

Answer 5

s= "Hi im Natsume, check out my site http://www.mysite.com/"
start=s.find("http://") if s.find("http://")!=-1 else s.find("https://")+1
t = s[start+11:s.find(" ",start+11)]
print(t)

output: mysite.com输出： mysite.com

Answer 6

Best way is to use regex to extract the URL.最好的方法是使用正则表达式来提取 URL。 Then use tldextract to get valid domain name from the URL.然后使用tldextract从 URL 中获取有效的域名。

import re
import tldextract

text = "Hi im Natsume, check out my site http://www.example.com/"
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
found_url = urls[0]
info = tldextract.extract(found_url)
domain_name = info.domain
suffix_name = info.suffix
final_domain_name  = domain_name+"."+suffix_name
print(final_domain_name)

Answer 7

How about this?这个怎么样？

url='https://www.google.com/' url='https://www.google.com/'

var=url.split('//www.')[1] var=url.split('//www.')[1]

domain=var[0:var.index('/')]域=var[0:var.index('/')]

print(domain)打印（域）

从字符串中获取域？ - Python

问题描述

7 个解决方案

解决方案1
1 2012-06-27 12:58:48

解决方案2
1 已采纳 2012-06-27 13:00:49

解决方案3
1 2012-06-27 13:01:17

解决方案4
1 2012-06-27 13:03:09

解决方案5
1 2012-06-27 13:07:20

解决方案6
0 2020-10-13 06:56:11

解决方案7
-1 2022-10-03 10:39:27

从字符串中获取域？ - Python

问题描述

7 个解决方案

解决方案1 1 2012-06-27 12:58:48

解决方案2 1 已采纳 2012-06-27 13:00:49

解决方案3 1 2012-06-27 13:01:17

解决方案4 1 2012-06-27 13:03:09

解决方案5 1 2012-06-27 13:07:20

解决方案6 0 2020-10-13 06:56:11

解决方案7 -1 2022-10-03 10:39:27

解决方案1
1 2012-06-27 12:58:48

解决方案2
1 已采纳 2012-06-27 13:00:49

解决方案3
1 2012-06-27 13:01:17

解决方案4
1 2012-06-27 13:03:09

解决方案5
1 2012-06-27 13:07:20

解决方案6
0 2020-10-13 06:56:11

解决方案7
-1 2022-10-03 10:39:27