[英]How to count occurrences of dictionary key and sum values and print them?
我将计算域访问的带宽,并且需要知道域被击中了多少次。 我能够计算带宽,但不确定如何计算日志中域的出现。 任何想法都会有很大帮助。 非常感谢您提前提供帮助。
代码:
import os
import re
from collections import defaultdict
import string
merged_logs = []
line=[]
dict = defaultdict(int)
bandwidth = 0
path = ["/var/logs/"]
for i in path:
for filename in os.listdir(i):
with open(os.path.join(i, filename), 'r') as filedata:
merged_logs += filedata.readlines()
for line in merged_logs:
line_split = line.split(" ")
start = "CONNECT "
end = " -"
domain_str = line[line.find(start)+len(start):line.find(end)]
if domain_str.find("/")>0:
domain_split = domain_str.split("/")
domain = domain_split[0]
if len(line_split)==10:
bandwidth = line_split[3]
if len(line_split)==11:
bandwidth = line_split[4]
else:
domain = domain_str
if len(line_split)==10:
bandwidth = line_split[3]
if len(line_split)==11:
bandwidth = line_split[4]
if domain not in dict:
dict[domain] = int(bandwidth)
else:
dict[domain] += int(bandwidth)
for key, value in dict.items():
print key, (value * 2 )/(1024 * 1024)
/var/logs 下的示例日志文件包含以下行:
1569935790.563 1010 192.168.10.3 TCP_TUNNEL/200 1001803 CONNECT www.google.com:443 - HIER_DIRECT/www.google.com - 192.168.100.3
1569935790.563 1010 192.168.10.3 TCP_TUNNEL/200 1001085 CONNECT www.google.com:443 - HIER_DIRECT/www.google.com - 192.168.100.3
1569935790.563 1010 192.168.10.3 TCP_TUNNEL/200 1000182 CONNECT www.google.com:443 - HIER_DIRECT/www.google.com - 192.168.100.3
1569935790.563 1010 192.168.10.3 TCP_TUNNEL/200 1006183 CONNECT www.xyz.com/index.php - HIER_DIRECT/www.xyz.com - 192.168.100.3
1569935790.563 1010 192.168.10.3 TCP_TUNNEL/200 1091083 CONNECT www.xyz.com/index.php - HIER_DIRECT/www.xyz.com - 192.168.100.3
1569935790.563 1010 192.168.10.3 TCP_TUNNEL/200 2091803 CONNECT www.xyz.com/index.php - HIER_DIRECT/www.xyz.com - 192.168.100.3
1569935790.563 1010 192.168.10.3 TCP_TUNNEL/200 2091083 CONNECT www.xyz.com/index.php - HIER_DIRECT/www.xyz.com - 192.168.100.3
59375 192.168.10.3 TAG_NONE/503 10 CONNECT www.google.com - HIER_NONE/- - 192.168.100.3
Output 应采用以下格式:
Domain Bandwidth (MB) Hit (Count)
www.xyz.com 11 4
www.google.com 5 3
import os
import re
from collections import defaultdict, Counter
import string
# Compile Regex pattern beforehand for optimized computation
domain_pattern = re.compile("(CONNECT )(?P<domain>.*?)( -)")
# Initialize a defaultdict for Storing and Updating the Sum of Bandwidths
bandwidths = defaultdict(int)
# Initialize a Counter for Storing and Updating the Count of Hits
counts = Counter()
path = ["/var/logs/"]
for i in path:
for filename in os.listdir(i):
with open(os.path.join(i, filename), 'r') as filedata:
merged_logs += filedata.readlines()
for line in merged_logs:
line_split = line.split(" ")
# Use re.search function to get the string matching the Regex Pattern
# Use group method to just fetch the named group: 'domain' as specified in the pattern
domain_str = re.search(domain_pattern, line).group('domain')
domain = domain_str.split("/")[0]
if len(line_split) == 10:
bandwidth = line_split[3]
elif len(line_split) == 11:
bandwidth = line_split[4]
else:
pass
# Update the defaultdict to add the bandwidth
bandwidths[domain] += int(bandwidth)
# Update the Counter to increment the count by 1
counts[domain] += 1
for domain in bandwidths:
bandwidth = int((bandwidths[domain] * 2 ) / (1024 * 1024))
hits = counts[domain]
print domain, bandwidth, hits
我在示例日志上运行上述代码,得到以下 output:
www.google.com:443 5 3
www.xyz.com 11 4
www.google.com 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.