I have two lists of integer values of the same length: a list of items and a list of labels. If an item is duplicated in the list of items, that means they are labeled using different integers in the list of labels. I want to to assign the same integer/label (eg the label of the first occurrence) to all of the items that are labeled with those integers (note that this can be more than just the duplicates we first found in the list of items).
Here is a minimal example of what I am doing (I converted the lists to arrays):
import numpy as np
import numba as nb
from collections import Counter
items = np.array([7,2,0,6,0,4,1,5,2,0])
labels = np.array([1,0,3,4,2,1,6,6,5,4])
dups = [x for x, c in Counter(items).items() if c>1]
#@nb.njit(fastmath=True)
def update_labels(items, labels, dups):
for dup in dups:
found = np.where(items==dup)[0]
l = labels[found]
isin = np.where((np.isin(labels, l)))[0]
labels[isin] = labels[isin[0]]
return labels
new_labels = update_labels(items, labels, dups)
print(new_labels) # prints [1 0 3 3 3 1 6 6 0 3]
The code works fine for small lists. However, for larger lists like
np.random.seed(0)
n = 1_000_000
items = np.random.randint(n, size=n)
labels = np.random.randint(int(0.8*n), size=n)
it takes forever to return the new labels. The bottleneck is in the update_labels() function which I also tried to accelerate by using a numba jit decorator but it turns out that np.isin is not supported by numba .
Is there any way to make this algorithm more efficient and/or get it to work (efficiently) with numba ? The code efficiency is extremely important to me since I use this with huge lists (tens of millions). I am also open to use a C or C++ function and call it from Python as a last resort. I use Python 3.x.
items = np.array([7, 2, 0, 6, 0, 4, 1, 5, 2, 0])
labels = np.array([1, 0, 3, 4, 2, 1, 6, 6, 5, 4])
d = {}
for i in range(len(items)):
label = d.setdefault(items[i], labels[i])
if label != labels[i]:
labels[i] = label
Output
[1 0 3 4 3 1 6 6 0 3]
This one gives the same output as the original version.
def update_labels(items, labels):
i_dict, l_dict, ranks = {}, {}, {}
for i in range(len(items)):
label = i_dict.setdefault(items[i], labels[i])
if labels[i] not in ranks:
ranks[labels[i]] = i
if label != labels[i]:
label1 = label
label2 = labels[i]
while label1 is not None and label2 is not None:
if ranks[label1] > ranks[label2]:
tmp = l_dict.get(label1)
l_dict[label1] = label2
label1 = tmp
elif ranks[label1] < ranks[label2]:
tmp = l_dict.get(label2)
l_dict[label2] = label1
label2 = tmp
else:
break
labels[i] = label
for i in range(len(labels)):
val = 0
label = labels[i]
while val != -1:
val = l_dict.get(label, -1)
if val != -1:
label = val
if label != labels[i]:
labels[i] = label
return labels
I feel that your code is quite optimized already. The only thing I noticed is that if you slice the dups
array and you apply your function update_labels
to a subproblem limited to the concerned indexes, you can win more than a factor 2 for a problem with size n=100_000
(cf. function update_labels_2
). Pramote Kuacharoen's solution (cf. function update_labels_2
) is way faster but doesn't give the correct solution on a big problem (dunno if the solution it produces is acceptable for you):
import numpy as np
import numba as nb
from collections import Counter
import time
np.random.seed(0)
n = 100_000
items = np.random.randint(n, size=n)
labels = np.random.randint(int(0.8*n), size=n)
dups = np.array([x for x, c in Counter(items).items() if c>1])
# --------------- 1. Original solution ---------------
def update_labels(items, labels, dups):
for dup in dups:
found = np.where(items==dup)[0]
l = labels[found]
isin = np.where((np.isin(labels, l)))[0]
labels[isin] = labels[isin[0]]
return labels
t_start = time.time()
new_labels = update_labels(items, np.copy(labels), dups)
print('Timer 1:', time.time()-t_start, 's')
# --------------- 2. Splitting into subproblems ---------------
def update_labels_2(items, labels, dups):
nb_slices = 20
offsets = [int(o) for o in np.linspace(0,dups.size,nb_slices+1)]
for i in range(nb_slices):
#for i in range(nb_slices-1,-1,-1): # ALSO WORKS
sub_dups = dups[offsets[i]:offsets[i+1]]
l = labels[np.isin(items, sub_dups)]
sub_index = np.where(np.isin(labels, l))[0]
# Apply your function to subproblem
labels[sub_index] = update_labels(items[sub_index], labels[sub_index], sub_dups)
return labels
t_start = time.time()
new_labels_2 = update_labels_2(items, np.copy(labels), dups)
print('Timer 2:', time.time()-t_start, 's')
print('Results 1&2 are equal!' if np.allclose(new_labels,new_labels_2) else 'Results 1&2 differ!')
# --------------- 3. Pramote Kuacharoen solution ---------------
def update_labels_3(items, labels, dups):
i_dict, l_dict = {}, {}
for i in range(len(labels)):
indices = l_dict.setdefault(labels[i], [])
indices.append(i)
for i in range(len(items)):
label_values = i_dict.setdefault(items[i], [])
if len(label_values) != 0 and labels[i] not in label_values:
labels[i] = label_values[0]
label_values.append(labels[i])
for key, value in l_dict.items():
label = ''
sizes = []
for v in value:
sizes.append(len(i_dict[items[v]]))
idx = np.argmax(sizes)
label = labels[value[idx]]
for v in value:
labels[v] = label
return labels
t_start = time.time()
new_labels_3 = update_labels_3(items, np.copy(labels), dups)
print('Timer 3:', time.time()-t_start, 's')
print('Results 1&3 are equal!' if np.allclose(new_labels,new_labels_3) else 'Results 1&3 differ!')
Output:
% python3 script.py
Timer 1: 5.082866907119751 s
Timer 2: 1.9104671478271484 s
Results 1&2 are equal!
Timer 3: 0.7601778507232666 s
Results 1&3 differ!
Unfortunately, that's the best speed-up I obtain is with nb_slices=20
. However there is still hope because you can verify that when running the loop in reverse order in function update_labels_2
, you still obtain the same order so, if you can prove that the subproblems are independent, you can go very fast if you compute the subproblems in parallel using mpi4py
for instance.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.