[英]Extract and count domains address mails from e-mails
我有一封電子郵件列表,只想提取域並計算每個域出現多少次:
電郵:
best@yahoo.com
hello@gmail.com
每個人@ gmail.com
bye@gmail.com
day@yahoo.com
table.blue@gmail.com
life@yahoo.com
腳本:
import re
from collections import Counter
with open("mails.txt", "r") as f:
texte = f.read().split('\n')
for line in texte:
newline = re.search("@[\w.]+", line)
newmail = newline.group()
mails_value = Counter(newmail).most_common()
print (mails_value)
輸出:
[('@',1),('g',1),('6',1),('5',1),('。',1),('f',1),( 'r',1)]
追溯(最近一次通話):
文件“ counting.py”,第10行,在
newmail = newline.group()
AttributeError:'NoneType'對象沒有屬性'group'
好的輸出:
@ yahoo.com 3
@ gmail.com 4
您非常接近-無需將文件拆分為行,只需使用re.findall
, re.MULTILINE
和模式@(。*)$
import re
import collections
with open("mails.txt") as f:
text = f.read()
domains = re.findall(r'@(.*)$', text, re.MULTILINE)
mails_value = collections.Counter(domains)
# outputs with example: Counter({'gmail.com': 4, 'yahoo.com': 3})
您不需要正則表達式。 如果您可以確信所有輸入內容都是格式正確的電子郵件,那么就足夠了:
from collections import defaultdict
domain_count = defaultdict(lambda: 0)
with open("mails.txt", "r") as f:
texte = f.readlines()
for line in texte:
domain = line.split('@')[-1]
domain_count[domain] += 1
print (domain_count)
正則表達式可以避免創建不必要的列表。
import re
from collections import Counter
with open("mails.txt", "r") as f:
texte = f.read().split('\n')
l=[]
for line in texte:
p=re.compile("(?<=@)[^.]+(?=\.)")
newline = p.search(line)
if(newline):
newmail = newline.group(0)
l.append(newmail)
Counter(l)
輸出值
Counter({'gmail': 4, 'yahoo': 3})
您可以使用拆分
texte = "life@yahoo.com"
texte.split("@")
['life', 'yahoo.com']
做2分裂。 第二個帶有@.。然后附加最后一項並將計數器應用於列表
import re
from collections import Counter
with open("mails.txt", "r") as f:
texte = f.read().split('\n')
domains = []
for line in texte:
line = line.split('@')
if line[-1] != "":
domains.append(line[-1])
mails_value = Counter(domains).most_common()
print(mails_value)
[('gmail.com', 4), ('yahoo.com', 3)]
import re
from collections import Counter
mails = []
with open("mails.txt", "r") as f:
texte = f.read().split()
for i in texte:
mails.append(re.search("@[\w.]+", i).group())
mails_value = Counter(mails).most_common()
print mails_value
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.