简体   繁体   中英

Fastest way to update a list of integers based on the duplicates in another list of integers

I have two lists of integer values of the same length: a list of items and a list of labels. If an item is duplicated in the list of items, that means they are labeled using different integers in the list of labels. I want to to assign the same integer/label (eg the label of the first occurrence) to all of the items that are labeled with those integers (note that this can be more than just the duplicates we first found in the list of items).

Here is a minimal example of what I am doing (I converted the lists to arrays):

import numpy as np
import numba as nb
from collections import Counter

items  = np.array([7,2,0,6,0,4,1,5,2,0])
labels = np.array([1,0,3,4,2,1,6,6,5,4])

dups = [x for x, c in Counter(items).items() if c>1]

#@nb.njit(fastmath=True)
def update_labels(items, labels, dups):
    for dup in dups:
        found = np.where(items==dup)[0]
        l = labels[found]
        isin = np.where((np.isin(labels, l)))[0]
        labels[isin] = labels[isin[0]]
    return labels

new_labels = update_labels(items, labels, dups)
print(new_labels) # prints [1 0 3 3 3 1 6 6 0 3]

The code works fine for small lists. However, for larger lists like

np.random.seed(0)
n = 1_000_000
items  = np.random.randint(n, size=n)
labels = np.random.randint(int(0.8*n), size=n)

it takes forever to return the new labels. The bottleneck is in the update_labels() function which I also tried to accelerate by using a numba jit decorator but it turns out that np.isin is not supported by numba .

Is there any way to make this algorithm more efficient and/or get it to work (efficiently) with numba ? The code efficiency is extremely important to me since I use this with huge lists (tens of millions). I am also open to use a C or C++ function and call it from Python as a last resort. I use Python 3.x.

items = np.array([7, 2, 0, 6, 0, 4, 1, 5, 2, 0])
labels = np.array([1, 0, 3, 4, 2, 1, 6, 6, 5, 4])

d = {}

for i in range(len(items)):
    label = d.setdefault(items[i], labels[i])
    if label != labels[i]:
        labels[i] = label

Output

[1 0 3 4 3 1 6 6 0 3]

This one gives the same output as the original version.

def update_labels(items, labels):
    i_dict, l_dict, ranks = {}, {}, {}

    for i in range(len(items)):
        label = i_dict.setdefault(items[i], labels[i])
        if labels[i] not in ranks:
            ranks[labels[i]] = i

        if label != labels[i]:
            label1 = label
            label2 = labels[i]
            while label1 is not None and label2 is not None:
                if ranks[label1] > ranks[label2]:
                    tmp = l_dict.get(label1)
                    l_dict[label1] = label2
                    label1 = tmp
                elif ranks[label1] < ranks[label2]:
                    tmp = l_dict.get(label2)
                    l_dict[label2] = label1
                    label2 = tmp
                else:
                    break

            labels[i] = label

    for i in range(len(labels)):
        val = 0
        label = labels[i]
        while val != -1:
            val = l_dict.get(label, -1)
            if val != -1:
                label = val
        if label != labels[i]:
            labels[i] = label

    return labels

I feel that your code is quite optimized already. The only thing I noticed is that if you slice the dups array and you apply your function update_labels to a subproblem limited to the concerned indexes, you can win more than a factor 2 for a problem with size n=100_000 (cf. function update_labels_2 ). Pramote Kuacharoen's solution (cf. function update_labels_2 ) is way faster but doesn't give the correct solution on a big problem (dunno if the solution it produces is acceptable for you):

import numpy as np
import numba as nb
from collections import Counter
import time

np.random.seed(0)
n = 100_000
items  = np.random.randint(n, size=n)
labels = np.random.randint(int(0.8*n), size=n)

dups = np.array([x for x, c in Counter(items).items() if c>1])

# --------------- 1. Original solution ---------------
def update_labels(items, labels, dups):
    for dup in dups:
        found = np.where(items==dup)[0]
        l = labels[found]
        isin = np.where((np.isin(labels, l)))[0]
        labels[isin] = labels[isin[0]]
    return labels

t_start = time.time()
new_labels = update_labels(items, np.copy(labels), dups)
print('Timer 1:', time.time()-t_start, 's')

# --------------- 2. Splitting into subproblems ---------------
def update_labels_2(items, labels, dups):
    nb_slices = 20
    offsets = [int(o) for o in np.linspace(0,dups.size,nb_slices+1)]
    for i in range(nb_slices):
    #for i in range(nb_slices-1,-1,-1): # ALSO WORKS
        sub_dups = dups[offsets[i]:offsets[i+1]]
        l = labels[np.isin(items, sub_dups)]
        sub_index = np.where(np.isin(labels, l))[0]
        # Apply your function to subproblem
        labels[sub_index] = update_labels(items[sub_index], labels[sub_index], sub_dups)
    return labels

t_start = time.time()
new_labels_2 = update_labels_2(items, np.copy(labels), dups)
print('Timer 2:', time.time()-t_start, 's')

print('Results 1&2 are equal!' if np.allclose(new_labels,new_labels_2) else 'Results 1&2 differ!')

# --------------- 3. Pramote Kuacharoen solution ---------------
def update_labels_3(items, labels, dups):
    i_dict, l_dict = {}, {}
    for i in range(len(labels)):
        indices = l_dict.setdefault(labels[i], [])
        indices.append(i)
    for i in range(len(items)):
        label_values = i_dict.setdefault(items[i], [])
        if len(label_values) != 0 and labels[i] not in label_values:
            labels[i] = label_values[0]
        label_values.append(labels[i])
    for key, value in l_dict.items():
        label = ''
        sizes = []
        for v in value:
            sizes.append(len(i_dict[items[v]]))
            idx = np.argmax(sizes)
            label = labels[value[idx]]
        for v in value:
            labels[v] = label
    return labels

t_start = time.time()
new_labels_3 = update_labels_3(items, np.copy(labels), dups)
print('Timer 3:', time.time()-t_start, 's')
print('Results 1&3 are equal!' if np.allclose(new_labels,new_labels_3) else 'Results 1&3 differ!')

Output:

% python3 script.py
Timer 1: 5.082866907119751 s
Timer 2: 1.9104671478271484 s
Results 1&2 are equal!
Timer 3: 0.7601778507232666 s
Results 1&3 differ!

Unfortunately, that's the best speed-up I obtain is with nb_slices=20 . However there is still hope because you can verify that when running the loop in reverse order in function update_labels_2 , you still obtain the same order so, if you can prove that the subproblems are independent, you can go very fast if you compute the subproblems in parallel using mpi4py for instance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM