如何从python中的URL获取域名（名称+ TLD）

Question

I want to extract the domain name(name of the site+TLD) from a list of URLs which may vary in their format. 我想从URL列表中提取域名（站点名称+ TLD），这些URL的格式可能不同。 for instance: Current state---->what I want 例如：当前状态---->我想要什么

mail.yahoo.com------> yahoo.com
account.hotmail.co.uk---->hotmail.co.uk
x.it--->x.it
google.mail.com---> google.com

Is there any python code that can help me with extracting what I want from URL or should I do it manually? 是否有任何python代码可以帮助我从URL中提取我想要的内容，还是应该手动执行？

Answer 1

This is somewhat non-trivial, as there is no simple rule to determine what makes a for a valid public suffix (site name + TLD). 这有点不重要，因为没有简单的规则来确定什么是有效的公共后缀（站点名称+ TLD）。 Instead, what makes a public suffix is maintained as a list at PublicSuffix.org . 相反，什么使公共后缀在PublicSuffix.org上作为列表维护。

A python package exists that queries that list (stored locally); 存在查询列表（本地存储）的python包; it's called publicsuffix : 它被称为publicsuffix ：

>>> from publicsuffix import PublicSuffixList
>>> psl = PublicSuffixList()
>>> print psl.get_public_suffix('mail.yahoo.com')
yahoo.com
>>> print psl.get_public_suffix('account.hotmail.co.uk')
hotmail.co.uk

Answer 2

There is a public list of TLD and CC TLD that is maintained. 维护了TLD和CC TLD的公开列表。

This python project reads this list and compares your URL against this list. 这个python项目读取此列表并将您的URL与此列表进行比较。

https://github.com/john-kurkowski/tldextract

Answer 3

Using python tld 使用python tld

https://pypi.python.org/pypi/tld https://pypi.python.org/pypi/tld

$ pip install tld $ pip install tld

from tld import get_tld
print get_tld("http://www.google.co.uk/some-page/some-sub-page/")
'google.co.uk'

Answer 4

At this time I see six packages doing domain name splitting: 这时我看到六个包进行域名拆分：

They differ in the way they cache Public Suffix List data (only tldextract uses a JSON file, thereby sparing to parse the list on loading), in the strategy used to download that data, and in the structure they keep in memory (respectively: frozenset, set, set, dictionaries of labels, ditto, dictionary of names) which determines the search algorithm. 它们缓存公共后缀列表数据的方式不同（只有tldextract使用JSON文件，从而节省了在加载时解析列表），用于下载该数据的策略，以及它们保留在内存中的结构（分别为：frozenset），设置，设置，标签字典，同上，名字字典），它决定了搜索算法。

如何从python中的URL获取域名（名称+ TLD）

问题描述

4 个解决方案

解决方案1
8 已采纳 2013-03-17 12:50:33

解决方案2
2 2013-03-17 13:00:58

解决方案3
0 2013-12-10 09:07:47

解决方案4
0 2017-10-27 08:06:48

如何从python中的URL获取域名（名称+ TLD）

问题描述

4 个解决方案

解决方案1 8 已采纳 2013-03-17 12:50:33

解决方案2 2 2013-03-17 13:00:58

解决方案3 0 2013-12-10 09:07:47

解决方案4 0 2017-10-27 08:06:48

解决方案1
8 已采纳 2013-03-17 12:50:33

解决方案2
2 2013-03-17 13:00:58

解决方案3
0 2013-12-10 09:07:47

解决方案4
0 2017-10-27 08:06:48