简体   繁体   English

将字典与不可散列或不可比较的值进行比较? (例如列表或数据框)

[英]Compare Dictionaries with unhashable or uncomparable values? (e.g. Lists or Dataframes)

TL;DR: How can you compare two python dictionaries if some of them have values which are unhashable/mutable (eg lists or pandas Dataframes)? TL; DR:如果其中一些字典的值不可散列/可变(例如列表或熊猫数据框),那么如何比较两个python字典?


I have to compare dictionary pairs for equality. 我必须比较字典对是否相等。 In that sense, this question is similar to these two, but their solutions only seem to work for immutable objects ... 从这个意义上讲,这个问题类似于这两个问题,但是它们的解决方案似乎仅适用于不可变的对象 ...

My problem, is that I'm dealing with pairs of highly nested dictionaries where the unhashable objects could be found in different places depending on which pair of dictionaries I'm comparing. 我的问题是,我要处理成对的高度嵌套的字典根据我要比较的成对字典可以在不同的位置找到无法散列的对象 My thinking is that I'll need to iterate across the deapest values contained in the dictionary and can't just rely on the dict.iteritems() which only unrolls the highest key-value pairs. 我的想法是,我需要遍历字典中包含的最差值,而不能仅依赖dict.iteritems()来展开最高键值对。 I'm not sure how iterate across all the possible key-value pairs contained in the dictionary and compare them either using sets/== for the hashable objects and in the cases of pandas dataframes, running df1.equals(df2). 我不确定如何遍历字典中包含的所有可能的键-值对,并使用可散列对象的set / ==和在运行df1.equals(df2).的熊猫数据帧中对它们进行比较df1.equals(df2). (Note for pandas dataframe, just running df1==df2 does a piecewise comparison and NA's are poorly handled. df1.equals(df2) gets around that does the trick.) (请注意,对于熊猫数据框,仅运行df1==df2会进行分段比较,并且NA的处理df1.equals(df2)可以解决问题。)

So for example: 因此,例如:

a = {'x': 1, 'y': {'z': "George", 'w': df1}}
b = {'x': 1, 'y': {'z': "George", 'w': df1}}
c = {'x': 1, 'y': {'z': "George", 'w': df2}}

At a minimum, and this would be pretty awesome already, the solution would yield TRUE/FALSE as to whether their values are the same and would work for pandas dataframes. 至少,这已经非常棒了,该解决方案将对它们的值是否相同产生TRUE / FALSE,并且适用于熊猫数据帧。

def dict_compare(d1, d2):
   if ...
      return True
   elif ...
      return False

dict_compare(a,b)
>>> True
dict_compare(a,c)
>>> False

Moderately better: the solution would point out what key/values would be different across the dictionaries. 略胜一筹:该解决方案将指出各词典中哪些键/值会有所不同。

In the ideal case: the solution could separate the values into 4 groupings: 在理想情况下:解决方案可以将值分为4个组:

  • added, 添加,
  • removed, 删除,
  • modified 改性
  • same 相同

Well, there's a way to make any type comparable: Simply wrap it in a class that compares like you need it: 好了,有一种方法可以使任何类型具有可比性:只需将其包装在一个类中,就可以像您需要的那样进行比较:

class DataFrameWrapper():
    def __init__(self, df):
        self.df = df

    def __eq__(self, other):
        return self.df.equals(other.df)

So when you wrap your "uncomparable" values you can now simply use == : 因此,当您包装“无与伦比”的值时,现在只需使用==

>>> import pandas as pd

>>> df1 = pd.DataFrame({'a': [1,2,3]})
>>> df2 = pd.DataFrame({'a': [3,2,1]})

>>> a = {'x': 1, 'y': {'z': "George", 'w': DataFrameWrapper(df1)}}
>>> b = {'x': 1, 'y': {'z': "George", 'w': DataFrameWrapper(df1)}}
>>> c = {'x': 1, 'y': {'z': "George", 'w': DataFrameWrapper(df2)}}
>>> a == b
True
>>> a == c
False

Of course wrapping your values has it's disadvantages but if you only need to compare them that would be a very easy approach. 当然,包装您的价值观是有缺点的,但是如果您只需要比较它们,那将是一个非常简单的方法。 All that may be needed is a recursive wrapping before doing the comparison and a recursive unwrapping afterwards: 可能需要做的是在进行比较之前先进行递归包装,然后再进行递归展开:

def recursivewrap(dict_):
    for key, value in dict_.items():
        wrapper = wrappers.get(type(value), lambda x: x)  # for other types don't wrap
        dict_[key] = wrapper(value)
    return dict_  # return dict_ so this function can be used for recursion

def recursiveunwrap(dict_):
    for key, value in dict_.items():
        unwrapper = unwrappers.get(type(value), lambda x: x)
        dict_[key] = unwrapper(value)
    return dict_

wrappers = {pd.DataFrame: DataFrameWrapper,
            dict: recursivewrap}
unwrappers = {DataFrameWrapper: lambda x: x.df,
              dict: recursiveunwrap}

Sample case: 样品盒:

>>> recursivewrap(a)
{'x': 1,
 'y': {'w': <__main__.DataFrameWrapper at 0x2affddcc048>, 'z': 'George'}}
>>> recursiveunwrap(recursivewrap(a))
{'x': 1, 'y': {'w':    a
  0  1
  1  2
  2  3, 'z': 'George'}}

If you feel really adventurous you could use wrapper classes that depending on the comparison result modify some variable that holds the information what wasn't equal. 如果您真的很喜欢冒险,可以使用包装器类,这些包装器类根据比较结果来修改一些包含不相等信息的变量。


This part of the answer was based on the original question that didn't include nestings: 答案的这一部分基于不包含嵌套的原始问题:

You can seperate the unhashable values from the hashable values and do a set-comparison for the hashable values and a "order-independant" list-comparison for the unhashables: 您可以将不可散列的值与可散列的值分开,并对可散列的值进行设置比较,对不可散列的值进行“顺序无关”列表比较:

def split_hashable_unhashable(vals):
    """Seperate hashable values from unhashable ones and returns a set (hashables) 
    and list (unhashable ones)"""
    set_ = set()
    list_ = []
    for val in vals:
        try:
            set_.add(val)
        except TypeError:  # unhashable
            list_.append(val)
    return set_, list_


def compare_lists_arbitary_order(l1, l2, cmp=pd.DataFrame.equals):
    """Compare two lists using a custom comparison function, the order of the
    elements is ignored."""
    # need to have equal lengths otherwise they can't be equal
    if len(l1) != len(l2):  
        return False

    remaining_indices = set(range(len(l2)))
    for item in l1:
        for cmpidx in remaining_indices:
            if cmp(item, l2[cmpidx]):
                remaining_indices.remove(cmpidx)
                break
        else:
            # Run through the loop without finding a match
            return False
    return True

def dict_compare(d1, d2):
    if set(d1) != set(d2):  # compare the dictionary keys
        return False
    set1, list1 = split_hashable_unhashable(d1.values())
    set2, list2 = split_hashable_unhashable(d2.values())
    if set1 != set2:  # set comparison is easy
        return False

    return compare_lists_arbitary_order(list1, list2)

It got a bit longer than expected. 它的时间比预期的要长。 For your test-cases it definetly works: 对于您的测试用例,它绝对可以工作:

>>> import pandas as pd

>>> df1 = pd.DataFrame({'a': [1,2,3]})
>>> df2 = pd.DataFrame({'a': [3,2,1]})

>>> a = {'x': 1, 'y': df1}
>>> b = {'y': 1, 'x': df1}
>>> c = {'y': 1, 'x': df2}
>>> dict_compare(a, b)
True
>>> dict_compare(a, c)
False
>>> dict_compare(b, c)
False

The set -operations can also be used to find differences (see set.difference ). set -operations也可以用于查找差异(请参阅set.difference )。 It's a bit more complicated with the list s, but not really impossible. list稍微复杂一点,但并不是没有可能。 One could add the items where no match was found to a seperate list instead of instantly returning False . 可以将找不到匹配项的项目添加到单独的列表中,而不是立即返回False

Deepdiff library provides extensive ability to diff two python dictionaries Deepdiff库提供了比较两个python字典的强大功能

https://github.com/seperman/deepdiff https://github.com/seperman/deepdiff

DeepDiff: Deep Difference of dictionaries, iterables, strings and other objects. DeepDiff:字典,可迭代项,字符串和其他对象的深层差异。 It will recursively look for all the changes. 它将递归地查找所有更改。

pip install deepdiff pip安装deepdiff

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM