[英]Retrieve domain name from email address
I am working on a huge email-address dataset in Python
and need to retrieve the organization name.我正在用
Python
处理一个巨大的电子邮件地址数据集,需要检索组织名称。
For example, email@organizationName.com
is easy to extract, but what about email@info.organizationName.com
or even email@organizationName.co.uk
?例如,
email@organizationName.com
很容易提取,但是email@info.organizationName.com
甚至email@organizationName.co.uk
呢?
I need a universal extractor that should be able to handle all different possibilities accordingly.我需要一个通用的提取器,它应该能够相应地处理所有不同的可能性。
如果组织名称总是在 .com 或其他结尾之前 - 这可能有效 -
email_str.split('@')[1].split('.')[-2]
A regex won't work well here.正则表达式在这里不能很好地工作。 In order to be able to reliably do this, you need to use a lib that has knowledge on what constitutes a valid suffix.
为了能够可靠地做到这一点,您需要使用一个知道什么构成有效后缀的库。
Otherwise, how would the extractor be able distinguish email@info.organizationName.com
from email@organizationName.co.uk
?否则,提取器将如何区分
email@info.organizationName.com
和email@organizationName.co.uk
?
This can be done using tldextract :这可以使用tldextract来完成:
Example:例子:
import tldextract
emails = ['email@organizationName.com',
'email@info.organizationName.com',
'email@organizationName.co.uk',
'email@info.organizationName.co.uk',
]
for addr in emails:
print(tldextract.extract(addr))
Output:输出:
ExtractResult(subdomain='', domain='organizationName', suffix='com')
ExtractResult(subdomain='info', domain='organizationName', suffix='com')
ExtractResult(subdomain='', domain='organizationName', suffix='co.uk')
ExtractResult(subdomain='info', domain='organizationName', suffix='co.uk')
To access just the domain, use tldextract.extract(addr).domain
.要仅访问域,请使用
tldextract.extract(addr).domain
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.