List of tuples: Select only tuples based on common element either at first or second position

Question

I have a huge list of tuples :

ijs = [(0,1),(0,2),(0,3), (3,2)...]

For a given value v , I want to get only the (i,j) pairs that have either i=v or j=v (from all possible (i,j) pairs stored in ijs ).

Eg for v=0 and given ijs = [(0,1),(0,2),(0,3), (3,2)] , then I should get back only_current = [(0,1),(0,2),(0,3)]

Example:

Please ignore the first 3 lines where I built a list ijs having tuples inside.

import numpy as np

# IGNORE THIS PART UNTIL THE MAIN LOOP 
N= 1000
indices_upper_triangle = np.triu_indices(N, k = 1) # k=1 above diagonal
i,j = indices_upper_triangle
ijs = [(i,j) for i,j in zip(i,j)] # create (i,j) positions

# MAIN LOOP HERE 
# Main loop
all_neig_results = list()
for v in range(N): # for each point

    # from all possible (i,j) pairs, get only (i,j) pairs that have either i=v or j=v
    only_current = [item for item in ijs if v in item]

    all_neig_results.append(only_current)

The list comprehension in the loop is super slow.

%timeit [item for item in ijs if v in item]
15.9 s ± 361 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If I remove the check argument if v in item :

%timeit [item for item in ijs]
1.28 s ± 90.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How can I optimize [item for item in ijs if v in item] ?

Answer 1

You are doing O(N^2) work here, because you go over the whole list each time in the inner loop.

You should precompute once:

import collections
def precompute(pairs):
    d = collections.defaultdict(list)
    for ij in pairs:
        d[ij[0]].append(ij)
        d[ij[1]].append(ij)
    return d

d = precompute(ijs)

to make the work in the inner loop fast:

for v in range(N):
    only_current = d[v]

this way you are doing only an O(N) work.

Answer 2

If your goal is to find nearest neighbors, I'd recommend using a KDTree , which

can be used to rapidly look up the nearest neighbors of any point.

and avoids having to loop over all pairs of points / indexes.

See this example: How to find set of points in x,y grid using KDTree.query_ball_tree

Concerning the leafsize parameter: KDTrees "split up" space into different regions. Each leaf corresponds to a region. if leafsize =1 then it continues to make splits until each region contains only one point. This means building the tree will take a long time, but searches will be fast. If the leaves contain more points, the tree is shallower so building it will be faster. But queries will probably be slower. the "brute-force" bit in the doc refers to the fact that if you have more than 1 point in each region, you need to brute force to figure out nearest neighbors among those in the region.

leafsize=10 will mean the leaf has 10 or less. Each split is such that child nodes have (approximately) half of the points in the parent node, so the precise number of points in each leaf will be variable. there are other implementation details that could affect this.

List of tuples: Select only tuples based on common element either at first or second position

Question

2 answers

solution1
2 ACCPTED 2020-06-18 15:16:26

solution2
1 2020-06-18 15:00:44

List of tuples: Select only tuples based on common element either at first or second position

Question

2 answers

solution1 2 ACCPTED 2020-06-18 15:16:26

solution2 1 2020-06-18 15:00:44

solution1
2 ACCPTED 2020-06-18 15:16:26

solution2
1 2020-06-18 15:00:44