简体   繁体   English

从电子邮件地址检索域名

[英]Retrieve domain name from email address

I am working on a huge email-address dataset in Python and need to retrieve the organization name.我正在用Python处理一个巨大的电子邮件地址数据集,需要检索组织名称。

For example, email@organizationName.com is easy to extract, but what about email@info.organizationName.com or even email@organizationName.co.uk ?例如, email@organizationName.com很容易提取,但是email@info.organizationName.com甚至email@organizationName.co.uk呢?

I need a universal extractor that should be able to handle all different possibilities accordingly.我需要一个通用的提取器,它应该能够相应地处理所有不同的可能性。

如果组织名称总是在 .com 或其他结尾之前 - 这可能有效 -

        email_str.split('@')[1].split('.')[-2]

A regex won't work well here.正则表达式在这里不能很好地工作。 In order to be able to reliably do this, you need to use a lib that has knowledge on what constitutes a valid suffix.为了能够可靠地做到这一点,您需要使用一个知道什么构成有效后缀的库。

Otherwise, how would the extractor be able distinguish email@info.organizationName.com from email@organizationName.co.uk ?否则,提取器将如何区分email@info.organizationName.comemail@organizationName.co.uk

This can be done using tldextract :这可以使用tldextract来完成:

Example:例子:

import tldextract

emails = ['email@organizationName.com', 
          'email@info.organizationName.com', 
          'email@organizationName.co.uk',
          'email@info.organizationName.co.uk',
         ]

for addr in emails:
    print(tldextract.extract(addr))

Output:输出:

ExtractResult(subdomain='', domain='organizationName', suffix='com')
ExtractResult(subdomain='info', domain='organizationName', suffix='com')
ExtractResult(subdomain='', domain='organizationName', suffix='co.uk')
ExtractResult(subdomain='info', domain='organizationName', suffix='co.uk')

To access just the domain, use tldextract.extract(addr).domain .要仅访问域,请使用tldextract.extract(addr).domain

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM