How to find duplicates in a Python list of lists which elements are numpy.ndarray of shape (9, 103)

Question

I got a list (I call chunks) with len(chunks)=195 and len(chunks[0]) = 32 . The elements inside chunks[0] are of type numpy.ndarray and shape (9,103).

type(chunks[0][0])   
<class 'numpy.ndarray'>   
type(chunks[0][0][0])  
<class 'numpy.ndarray'>  
type(chunks[0][0][0][0])  
<class 'numpy.float64'>

I'm trying to find if there are duplicates in chunks[0] . The most appropriate way I thought of was len(chunks[0]) != set(chunks[0]) but that throws an error: 'TypeError: unhashable type'.

Is there another workable way to investigate whether elements inside the chunks[0] are equal and if so to eliminate the duplicates from the list? Could transforming them to tensors be advisable to check for duplicates in a fast way?

Answer 1

The problem

Hashables data types, ie, those that can be used as elements in sets or keys in dicts, have to be immutable. That's because you have to get the same hash value each time you try to look for it, but if you could modify it, the hash value would change. For example, lists and arrays can be changed and are therefore not hashable, but tuples are immutable so they are hashable.

One possible solution

You can create a tuple containing the values from your list or array or list of arrays, and use that in your set.

Sample code

You could use functions like these to solve your problem:

def 2d_array_to_tuples(a):
    return tuple(tuple(row) for row in a)

def list_of_2d_arrays_to_tuples(a_list):
    return tuple(2d_array_to_typles(a) for a in a_list)

These two functions return "2D" and "3D" tuples, which are hashable. You can insert their return values into sets.

And then this could work to detect if two chunks contain the same 32 arrays in the same order:

len(chunks) != len(set(list_of_2d_arrays_to_tuples(chunk) for chunk in chunks))

Or if you want to look for duplicate arrays within chunks[0] :

len(chunks[0]) != len(set(2d_array_to_tuples(a) for a in chunks[0]))

Eliminating the duplicates

If you want to eliminate the duplicates in the list, I would unroll those code a bit. Let chunk = chunks[0] and say you want uniq_chunk to have the arrays from chunk without the duplicates. This code should do the trick:

found = set()
uniq_chunk = []
for a in chunk:
    as_tuple = 2d_array_to_tuples(a)
    if as_tuple not in found:
        found.add(as_tuple)
        uniq_chunk.append(a)

You can adjust this approach to the exact thing you're trying to deduplicate.

How to find duplicates in a Python list of lists which elements are numpy.ndarray of shape (9, 103)

Question

1 answers

solution1
1 ACCPTED 2022-08-23 20:48:42

The problem

One possible solution

Sample code

Eliminating the duplicates

How to find duplicates in a Python list of lists which elements are numpy.ndarray of shape (9, 103)

Question

1 answers

solution1 1 ACCPTED 2022-08-23 20:48:42

The problem

One possible solution

Sample code

Eliminating the duplicates

solution1
1 ACCPTED 2022-08-23 20:48:42