简体   繁体   English

Python:两个列表列表的交集

[英]Python: Intersection of two lists of lists

I have a list of lists A and a list of lists B where A and B have many identical sublists. 我有一个列表A列表和一个列表B列表,其中AB有许多相同的子列表。

What is the best way to get the unique sublists out of B and into A ? BA获取唯一子列表的最佳方法是什么?

A = [['foo', 123], ['bar', np.array(range(10))], ['baz', 345]]
B = [['foo', 123], ['bar', np.array(range(10))], ['meow', 456]]

=> A = [['foo', 123], ['bar', np.array(range(10))], ['baz', 345], ['meow', 456]]

I tried: 我试过了:

A += [b for b in B if b not in A]

But this gives me a ValueError saying to use any() or all() . 但这给了我一个ValueError说使用any()all() Do I really have to test element by element for every sublist in B across every sublist in A ? 我是否真的必须逐个元素地测试A每个子列表中B每个子列表?

ERROR: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Usually, you would use one of the many ways to uniquify a list or several lists either in order or not. 通常,您可以使用多种方法之一来按顺序或不按顺序统一列表或多个列表。

Here is a way to uniquify two list that does not maintain order: 这是一种统一两个不维护顺序的列表的方法:

>>> A=[1,3,5,'a','c',7]
>>> B=[1,2,3,'c','b','a',6]
>>> set(A+B)
set(['a', 1, 'c', 3, 5, 6, 7, 2, 'b'])

Here is a way that does maintain order: 这是一种维护秩序的方法:

>>> seen=set()
>>> [e for e in A+B if e not in seen and (seen.add(e) or True)]
[1, 3, 5, 'a', 'c', 7, 2, 'b', 6]

The problem is that all elements must be hashable to use these methods: 问题是所有元素都必须是可以使用这些方法的:

>>> set([np.array(range(10)), 22])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'numpy.ndarray'

One way around this is to use the repr of each element: 解决这个问题的一种方法是使用每个元素的repr

>>> set([repr(e) for e in [np.array(range(10)), 22]])
set(['22', 'array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])'])

Or a use a frozenset : 或者使用冷冻套装

>>> set(frozenset(e) for e in [np.array(range(10)), np.array(range(2))])
set([frozenset([0, 1]), frozenset([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])])

In your case, the frozenset approach does not work on a list of lists: 在您的情况下,冻结集方法不适用于列表列表:

>>> set(frozenset(e) for e in [[np.array(range(10)), np.array(range(2))],[np.array(range(5))
]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <genexpr>
TypeError: unhashable type: 'numpy.ndarray'

So you would need to use flattened lists. 所以你需要使用扁平列表。

If the repr of the sublist is definitive proof of its uniquity, you could do this: 如果子列表的repr是其不公平的明确证据,您可以这样做:

from collections import OrderedDict
import numpy as np

A = [['foo', 123], ['bar', np.array(range(10))], ['baz', 345]]
B = [['foo', 123], ['bar', np.array(range(10))], ['meow', 456]]

seen=OrderedDict((repr(e),0) for e in B)

newA=[]
for e in A+B:
    key=repr(e)
    if key in seen:
        if seen[key]==0:
            newA.append(e)
            seen[key]=1
    else:
        seen[key]=1
        newA.append(e)

print newA
# [['foo', 123], ['bar', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])], ['baz', 345], ['meow', 456]]

Since the repr function returns a string that can be used by eval function to recreate the list, that is pretty definitive test but I cannot say for absolutely sure. 由于repr函数返回一个字符串, eval函数可以使用该字符串重新创建列表,这是非常明确的测试,但我不能绝对肯定地说。 It depends on what is in your list. 这取决于您的列表中的内容。

For example, the repr of a lambda cannot recreate the lambda: 例如,lambda的repr无法重新创建lambda:

>>> repr(lambda x:x)
'<function <lambda> at 0x10710ec08>'

But the string value of '<function <lambda> at 0x10710ec08>' is still definitively unique because the 0x10710ec08 part is the address in memory of the lambda (in cPython anyways). 但是'<function <lambda> at 0x10710ec08>'的字符串值仍然是绝对唯一的,因为0x10710ec08部分是lambda内存中的地址(反正在cPython中)。

You could also do what I mentioned above -- use a flattened list in frozenset as a signature of what you have seen or not: 你也可以做我上面提到的 - 在freezeset中使用flattened列表作为你所看到或不见的签名:

def flatten(LoL):
    for el in LoL:
        if isinstance(el, collections.Iterable) and not isinstance(el, basestring):
            for sub in flatten(el):
                yield sub
        else:
            yield el      
newA=[]    
seen=set()
for e in A+B:
    fset=frozenset(flatten(e))
    if fset not in seen:
        newA.append(e)
        seen.add(fset)

print newA        
# [['foo', 123], ['bar', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])], ['baz', 345], ['meow', 456]]

So if you have odd objects that are both unhashable and weird, non-unique repr string objects in A and B -- you are out of luck. 因此,如果你有一些奇怪的对象,这些对象在A和B中都是不可取的和奇怪的,非唯一的repr字符串对象 - 你运气不好。 Given your example, one of these methods should work though. 举个例子,其中一个方法应该可行。

You could do 你可以做到

import numpy as np

A = [['foo', 123], ['bar', np.array(range(10))], ['baz', 345]]
B = [['foo', 123], ['bar', np.array(range(10))], ['meow', 456]]

res = set().update(tuple(x) for x in A).update(tuple(x) for x in B)

except for the np.array items, which are unhashable... not sure what to do with those. 除了 np.array项目,这是不可取的...不知道如何处理这些。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM