简体   繁体   English

如何在python词典中获取唯一键和唯一值列表?

[英]How to get unique keys and list of unique values in python dictionary?

Sorry if the question seems similar to previous ones but I could not find any relevant answer to my exact problem. 抱歉,如果该问题似乎与之前的问题相似,但我找不到确切问题的任何相关答案。

I have a set of text files in a directory and I want to read them all and parse them. 我在目录中有一组文本文件,我想全部阅读并解析它们。 The format of the files are like this (which means the files have duplicate ip for one domain and duplicate domains for one ip and also repetitive pair of domain|ip): 文件的格式如下(这意味着文件具有一个域的重复IP和一个IP的重复域,以及域| ip的重复对):

file 1:    domain|ip
    yahoo.com|9.9.9.9
    mard.man.net|23.34.5.1
    bbc.net|86.45.76.5


file 2:
    google.com|9.9.9.9
    yahoo.com|9.9.9.9
    yahoo.com|23.34.5.1

and what I want is a dictionary that shows unique ips and their associated number of unique domains like below : 我想要的是一本字典,显示唯一的ip及其相关的唯一域的数量,如下所示:

9.9.9.9,2
23.34.5.1,2
86.45.76.5,1

Here is the script that I wrote for it. 这是我为此编写的脚本。

d = defaultdict(set)

for dirpath, dirs, files in os.walk(path):
    for filename in fnmatch.filter(files, '*.*'):
        with open(os.path.join(dirpath, filename)) as f:
            for line in f:
               if line.startswith('.'):
                    domain = line.split('|')[0]
                    ip = line.split('|')[1].strip('\n')
                    d[ip].add(domain)

But the problem is, since the script is running on several text files, if an ip (key) has been written once to the dictionary (d) from one text file and then it appears again in another text file, the dictionary would write it again with the new value something like this: 但是问题在于,由于脚本正在多个文本文件上运行,因此,如果一个IP(密钥)已从一个文本文件一次写入字典(d),然后又出现在另一个文本文件中,则字典将写入它再次使用新值,如下所示:

9.9.9.9,1
23.34.5.1,1
86.45.76.5,1
9.9.9.9,2
23.34.5.1,2

I think a better approach would be to link each ip address to the list of domains using it rather than capturing the last domain encountered. 我认为更好的方法是使用它将每个IP地址链接到域列表,而不是捕获遇到的最后一个域。

Like: 喜欢:

if ip in d:
   if domain not in d[ip]: 
     d[ip].append(domain)
else:
   d[ip] = [domain]

Now you can get the count by using 现在您可以通过使用

len(d[ip])

for any given ip 对于任何给定的IP

Why do not use Counter class from collections? 为什么不使用集合中的Counter类? It should be much faster. 它应该快得多。 So you could create empty counter object: 因此,您可以创建一个空的计数器对象:

c = Counter()

and then update it with data from newly read file. 然后使用新读取文件中的数据进行更新。 If files are not very big, I would suggest you slurping them at ones using "readlines" method and then processing all lines at once using list comprehensions. 如果文件不是很大,建议您使用“ readlines”方法将它们放到文件中,然后使用列表推导一次处理所有行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM