简体   繁体   中英

compute difference between lists in dictionary values

I have some dictionaries generated dynamically. They are generated using from collections import defaultdict and are as follows:

  a= defaultdict(list, {'speed_limit': [('0', '70')]})
  b= defaultdict(list, {'speed_limit': [('0', '70'),('0', '60'),
                                ('0','50')],'road_obstacles': [('0', '8')]})

What I want

  • Print nothing if 'a' is in 'b' which is true in the above case. Only print when keys or values inside are different.

  • In the above case, a has 1 tuple and b has 3 tuples, but a 's tuple is a part of b 's one so there should not be any difference.

What I tried

I tried a very conservative approach of nested loops which works but is not efficient. Additionally, it fails to handle the case when structures i am comparing get complex.

This is what i tried and this approach is very inefficient for large structures:

for key,value in a.iteritems():
    for key1,value1 in b.iteritems():
        if key!=key1:
           print "doesn't matches", key,value, key1,value1
        if key==key1: #check for values
            if value==value1: #if values are same
               print "key and value matches", key,value,key1,value1
            if value!=value1: #if values not same 
               print "key matches but value differs", key,value,key1,value1

Currently you're iterating over both dictionaries, essentially generating a Cartesian product. Sounds to me like what you really want is a union.

The union operator is | . It works on sets. To find the union of all the keys in the two dictionaries, use set(a.keys()) | set(b.keys()) set(a.keys()) | set(b.keys())

Edit: ivan_pozdeev points out that the sets can be calculated faster using set(a) | set(b) set(a) | set(b) .

You can then just iterate over that set once, checking key in a , key in b , and whether the values have common elements ( set(a_value) & set(b_value) ) as necessary. Here's an example:

all_keys = set(a.keys()) | set(b.keys())
for k in all_keys:
    if k in a:
        if k in b:
            print("Key is in both dictionaries:",k)
            a_value,b_value = a[k],b[k]
            if set(a_value) & set(b_value)):
                print("Values match")
            else:
                print("Values do not match")
        else: print("Key is in a but not b:",k)
    else: print("Key is in b but not a:",k)

That's just one way of doing it. Another way would be to calculate three sets: set(a.keys()) - set(b.keys()) for keys in a but not in b , set(b.keys()) - set(a.keys()) for keys in b but not in a , and set(a.keys()) & set(b.keys()) for the keys that are in both dictionaries.

For set operations, the fastest approach should be to convert the dict s to set s and use those from set implementation (as they are implemented in C):

>>> [
    set((k,iv) for k,v in var.iteritems() for iv in v)
    for var in a,b]
[{('speed_limit', ('0', '70'))},
 {('road_obstacles', ('0', '8'))
  ('speed_limit', ('0', '50')),
  ('speed_limit', ('0', '60')),
  ('speed_limit', ('0', '70'))}]
>>> sa,sb=_
>>> sb > sa
True
>>> sb - sa
{('road_obstacles', ('0', '8')),
 ('speed_limit', ('0', '50')),
 ('speed_limit', ('0', '60'))}
>>> sa - sb
set()
  • Pros: all complex operations are in C
  • Cons: an initial step with a loop, although optimization-friendly (maybe you're better off storing them as set s from the start? The decision depends on how often you need to perform set -friendly vs dict -friendly operations)

  • A timeit test on the variables in your example shows the times are roughly the same:

     def loops(): <your code> def sets(): <my code up to ba> <a and b are being taken from the interactive namespace> In [85]: timeit sets The slowest run took 14.36 times longer than the fastest. This could mean that a n intermediate result is being cached 1000000 loops, best of 3: 253 ns per loop In [86]: timeit loops The slowest run took 13.23 times longer than the fastest. This could mean that a n intermediate result is being cached 1000000 loops, best of 3: 253 ns per loop 
  • A timeit test on a randomized example with ~1000 elements shows that my code appears to start outperforming yours but the discrepancy is high:

     In [64]: alphabet='abcdefghijklmnopqrstuvwxyz_' In [41]: gen_word=lambda:''.join(random.choice(alphabet) for i in range(random.randrange(0,15))) In [66]: a={gen_word():[tuple(random.randrange(100) for _ in range(2)) for _ in range(random.randrange(10))] for _ in range(1000)} In [67]: b=a.copy() In [69]: b.update({gen_word():[tuple(random.randrange(100) for _ in range(2)) for _ in range(random.randrange(10))] for _ in range(500)}) In [74]: a.update({gen_word():[tuple(random.randrange(100) for _ in range(2)) for _ in range(random.randrange(10))] for _ in range(200)}) In [70]: len(b) Out[70]: 1336 In [75]: len(a) Out[75]: 1067 In [76]: timeit loops The slowest run took 9.94 times longer than the fastest. This could mean that an intermediate result is being cached 1000000 loops, best of 3: 337 ns per loop In [77]: timeit sets The slowest run took 14.43 times longer than the fastest. This could mean that a n intermediate result is being cached 1000000 loops, best of 3: 252 ns per loop 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM