Multi threaded python function not returning

Question

I'm trying to perform calculations on a large network object to perform some predictions on links appearing between nodes. I am able to do this in serial, but not in parallel using Pythons multiprocessing . The function never seems to return from the parallel implementation looking at my task manager it does not seem to take up a lot of memory or CPU-power either

def jaccard_serial_predictions(G):
    """
    Create a ranked list of possible new links based on the Jaccard similarity,
    defined as the intersection of nodes divided by the union of nodes

    parameters
    G: Directed or undirected nx graph
    returns
    list of linkbunches with the score as an attribute
    """
    potential_edges = []
    G_undirected = nx.Graph(G)
    for non_edge in nx.non_edges(G_undirected):
        u = set(G.neighbors(non_edge[0]))
        v = set(G.neighbors(non_edge[1]))
        uv_un = len(u.union(v))
        uv_int = len(u.intersection(v))
        if uv_int == 0 or uv_un == 0:
            continue
        else:
            s = (1.0*uv_int)/uv_un

        potential_edges.append(non_edge + ({'score': s},))

    return potential_edges

def jaccard_prediction(non_edge):
    u = set(G.neighbors(non_edge[0]))
    v = set(G.neighbors(non_edge[1]))
    uv_un = len(u.union(v))
    uv_int = len(u.intersection(v))
    if uv_int == 0 or uv_un == 0:
        return
    else:
        s = (1.0*uv_int)/uv_un
    return non_edge + ({'score': s},)

def jaccard_mp_predictions(G):
    """
    Create a ranked list of possible new links based on the Jaccard similarity,
    defined as the intersection of nodes divided by the union of nodes

    parameters
    G: Directed or undirected nx graph
    returns
    list of linkbunches with the score as an attribute
    """
    pool = mp.Pool(processes=4)
    G_undirected = nx.Graph(G)
    results = pool.map(jaccard_prediction, nx.non_edges(G_undirected))
    return results

Calling jaccard_serial_predictions(G) with G being a graph of 95000000 potential edges returns in 4.5 minutes, but jaccard_mp_predictions(G) does not return even after running for half an hour.

Answer 1

I'm not sure about this, but I think I'm spotting a potential slowdown. Compare the code for the serial operation on each node:

u = set(G.neighbors(non_edge[0]))
v = set(G.neighbors(non_edge[1]))
uv_un = len(u.union(v))
uv_int = len(u.intersection(v))
if uv_int == 0 or uv_un == 0:
    continue
else:
    s = (1.0*uv_int)/uv_un
potential_edges.append(non_edge + ({'score': s},))

With that for the parallel operation:

u = set(G.neighbors(non_edge[0]))
v = set(G.neighbors(non_edge[1]))
uv_un = len(u.union(v))
uv_int = len(u.intersection(v))
if uv_int == 0 or uv_un == 0:
    return
else:
    s = (1.0*uv_int)/uv_un
return non_edge + ({'score': s},)

In the serial version, whenever this condition uv_int == 0 or uv_un == 0 is true, you skip adding to the list. But in the parallelized version, you return None.

The mapping operation isn't smart enough to not add None to the list, while the serial operation just skips over those elements. That could lead to a slowdown, due to an additional appending operation for each non-scoreable element in the parallel version. If you have a lot of those, the slowdown could be huge!

Multi threaded python function not returning

Question

1 answers

solution1
1 2015-12-29 05:34:42

Multi threaded python function not returning

Question

1 answers

solution1 1 2015-12-29 05:34:42

solution1
1 2015-12-29 05:34:42