简体   繁体   中英

How to find duplicates in a Python list of lists which elements are numpy.ndarray of shape (9, 103)

I got a list (I call chunks) with len(chunks)=195 and len(chunks[0]) = 32 . The elements inside chunks[0] are of type numpy.ndarray and shape (9,103).

type(chunks[0][0])   
<class 'numpy.ndarray'>   
type(chunks[0][0][0])  
<class 'numpy.ndarray'>  
type(chunks[0][0][0][0])  
<class 'numpy.float64'>

I'm trying to find if there are duplicates in chunks[0] . The most appropriate way I thought of was len(chunks[0]) != set(chunks[0]) but that throws an error: 'TypeError: unhashable type'.

Is there another workable way to investigate whether elements inside the chunks[0] are equal and if so to eliminate the duplicates from the list? Could transforming them to tensors be advisable to check for duplicates in a fast way?

The problem

Hashables data types, ie, those that can be used as elements in sets or keys in dicts, have to be immutable. That's because you have to get the same hash value each time you try to look for it, but if you could modify it, the hash value would change. For example, lists and arrays can be changed and are therefore not hashable, but tuples are immutable so they are hashable.

One possible solution

You can create a tuple containing the values from your list or array or list of arrays, and use that in your set.

Sample code

You could use functions like these to solve your problem:

def 2d_array_to_tuples(a):
    return tuple(tuple(row) for row in a)

def list_of_2d_arrays_to_tuples(a_list):
    return tuple(2d_array_to_typles(a) for a in a_list)

These two functions return "2D" and "3D" tuples, which are hashable. You can insert their return values into sets.

And then this could work to detect if two chunks contain the same 32 arrays in the same order:

len(chunks) != len(set(list_of_2d_arrays_to_tuples(chunk) for chunk in chunks))

Or if you want to look for duplicate arrays within chunks[0] :

len(chunks[0]) != len(set(2d_array_to_tuples(a) for a in chunks[0]))

Eliminating the duplicates

If you want to eliminate the duplicates in the list, I would unroll those code a bit. Let chunk = chunks[0] and say you want uniq_chunk to have the arrays from chunk without the duplicates. This code should do the trick:

found = set()
uniq_chunk = []
for a in chunk:
    as_tuple = 2d_array_to_tuples(a)
    if as_tuple not in found:
        found.add(as_tuple)
        uniq_chunk.append(a)

You can adjust this approach to the exact thing you're trying to deduplicate.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM