简体   繁体   中英

How to get unique keys and list of unique values in python dictionary?

Sorry if the question seems similar to previous ones but I could not find any relevant answer to my exact problem.

I have a set of text files in a directory and I want to read them all and parse them. The format of the files are like this (which means the files have duplicate ip for one domain and duplicate domains for one ip and also repetitive pair of domain|ip):

file 1:    domain|ip
    yahoo.com|9.9.9.9
    mard.man.net|23.34.5.1
    bbc.net|86.45.76.5


file 2:
    google.com|9.9.9.9
    yahoo.com|9.9.9.9
    yahoo.com|23.34.5.1

and what I want is a dictionary that shows unique ips and their associated number of unique domains like below :

9.9.9.9,2
23.34.5.1,2
86.45.76.5,1

Here is the script that I wrote for it.

d = defaultdict(set)

for dirpath, dirs, files in os.walk(path):
    for filename in fnmatch.filter(files, '*.*'):
        with open(os.path.join(dirpath, filename)) as f:
            for line in f:
               if line.startswith('.'):
                    domain = line.split('|')[0]
                    ip = line.split('|')[1].strip('\n')
                    d[ip].add(domain)

But the problem is, since the script is running on several text files, if an ip (key) has been written once to the dictionary (d) from one text file and then it appears again in another text file, the dictionary would write it again with the new value something like this:

9.9.9.9,1
23.34.5.1,1
86.45.76.5,1
9.9.9.9,2
23.34.5.1,2

I think a better approach would be to link each ip address to the list of domains using it rather than capturing the last domain encountered.

Like:

if ip in d:
   if domain not in d[ip]: 
     d[ip].append(domain)
else:
   d[ip] = [domain]

Now you can get the count by using

len(d[ip])

for any given ip

Why do not use Counter class from collections? It should be much faster. So you could create empty counter object:

c = Counter()

and then update it with data from newly read file. If files are not very big, I would suggest you slurping them at ones using "readlines" method and then processing all lines at once using list comprehensions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM