从电子邮件中提取并计算域地址邮件

Question

I have a list of emails and would like to extract only the domains and count how many times each one appears: 我有一封电子邮件列表，只想提取域并计算每个域出现多少次：

Emails: 电邮：

best@yahoo.com best@yahoo.com

hello@gmail.com hello@gmail.com

everybody@gmail.com 每个人@ gmail.com

bye@gmail.com bye@gmail.com

day@yahoo.com day@yahoo.com

table.blue@gmail.com table.blue@gmail.com

life@yahoo.com life@yahoo.com

Script: 脚本：

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')

    for line in texte:
        newline = re.search("@[\w.]+", line)
        newmail = newline.group()

        mails_value = Counter(newmail).most_common()

        print (mails_value)

output: 输出：

[('@', 1), ('g', 1), ('6', 1), ('5', 1), ('.', 1), ('f', 1), ('r', 1)] [（'@'，1），（'g'，1），（'6'，1），（'5'，1），（'。'，1），（'f'，1），（ 'r'，1）]

Traceback (most recent call last): 追溯（最近一次通话）：

File "counting.py", line 10, in 文件“ counting.py”，第10行，在
 newmail = newline.group() 
AttributeError: 'NoneType' object has no attribute 'group' AttributeError：'NoneType'对象没有属性'group'

good output: 好的输出：

@yahoo.com 3 @ yahoo.com 3

@gmail.com 4 @ gmail.com 4

Answer 1

You're pretty close - No need to split the file into lines, just use re.findall , re.MULTILINE and the pattern @(.*)$ 您非常接近-无需将文件拆分为行，只需使用re.findall ， re.MULTILINE和模式@（。*）$

import re
import collections

with open("mails.txt") as f:
    text = f.read()
domains = re.findall(r'@(.*)$', text, re.MULTILINE)
mails_value = collections.Counter(domains) 
# outputs with example: Counter({'gmail.com': 4, 'yahoo.com': 3})

Answer 2

You don't need a regex. 您不需要正则表达式。 If you can trust that all the inputs are well formed emails, this should suffice: 如果您可以确信所有输入内容都是格式正确的电子邮件，那么就足够了：

from collections import defaultdict

domain_count = defaultdict(lambda: 0)

with open("mails.txt", "r") as f:
    texte = f.readlines()

    for line in texte:
        domain = line.split('@')[-1]
        domain_count[domain] += 1

print (domain_count)

Answer 3

The regex will save you from creating an unnecessary list. 正则表达式可以避免创建不必要的列表。

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')
    l=[]
    for line in texte:
        p=re.compile("(?<=@)[^.]+(?=\.)")
        newline = p.search(line)
        if(newline):

            newmail = newline.group(0)
            l.append(newmail)

Counter(l)

OUTPUT 输出值

Counter({'gmail': 4, 'yahoo': 3})

Answer 4

you can use split 您可以使用拆分

texte = "life@yahoo.com"
texte.split("@")
['life', 'yahoo.com']

Answer 5

do 2 splits. 做2分裂。 The second with @.. Then append the last item and apply the counter to the list 第二个带有@.。然后附加最后一项并将计数器应用于列表

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')

    domains = []

    for line in texte:
        line = line.split('@')
        if line[-1] != "":
            domains.append(line[-1])

mails_value = Counter(domains).most_common()

print(mails_value)

[('gmail.com', 4), ('yahoo.com', 3)]

Answer 6

import re
from collections import Counter

mails = []

with open("mails.txt", "r") as f:
    texte = f.read().split()
    for i in texte:
        mails.append(re.search("@[\w.]+", i).group())

mails_value = Counter(mails).most_common()
print mails_value

从电子邮件中提取并计算域地址邮件

问题描述

6 个解决方案

解决方案1
2 已采纳 2018-07-06 12:57:27

解决方案2
2 2018-07-06 12:58:37

解决方案3
2 2018-07-06 12:59:52

解决方案4
1 2018-07-06 12:57:18

解决方案5
1 2018-07-06 12:57:30

解决方案6
1 2018-07-06 13:02:38

从电子邮件中提取并计算域地址邮件

问题描述

6 个解决方案

解决方案1 2 已采纳 2018-07-06 12:57:27

解决方案2 2 2018-07-06 12:58:37

解决方案3 2 2018-07-06 12:59:52

解决方案4 1 2018-07-06 12:57:18

解决方案5 1 2018-07-06 12:57:30

解决方案6 1 2018-07-06 13:02:38

解决方案1
2 已采纳 2018-07-06 12:57:27

解决方案2
2 2018-07-06 12:58:37

解决方案3
2 2018-07-06 12:59:52

解决方案4
1 2018-07-06 12:57:18

解决方案5
1 2018-07-06 12:57:30

解决方案6
1 2018-07-06 13:02:38