I'm trying to perform calculations on a large network object to perform some predictions on links appearing between nodes. I am able to do this in serial, but not in parallel using Pythons multiprocessing
. The function never seems to return from the parallel implementation looking at my task manager it does not seem to take up a lot of memory or CPU-power either
def jaccard_serial_predictions(G):
"""
Create a ranked list of possible new links based on the Jaccard similarity,
defined as the intersection of nodes divided by the union of nodes
parameters
G: Directed or undirected nx graph
returns
list of linkbunches with the score as an attribute
"""
potential_edges = []
G_undirected = nx.Graph(G)
for non_edge in nx.non_edges(G_undirected):
u = set(G.neighbors(non_edge[0]))
v = set(G.neighbors(non_edge[1]))
uv_un = len(u.union(v))
uv_int = len(u.intersection(v))
if uv_int == 0 or uv_un == 0:
continue
else:
s = (1.0*uv_int)/uv_un
potential_edges.append(non_edge + ({'score': s},))
return potential_edges
def jaccard_prediction(non_edge):
u = set(G.neighbors(non_edge[0]))
v = set(G.neighbors(non_edge[1]))
uv_un = len(u.union(v))
uv_int = len(u.intersection(v))
if uv_int == 0 or uv_un == 0:
return
else:
s = (1.0*uv_int)/uv_un
return non_edge + ({'score': s},)
def jaccard_mp_predictions(G):
"""
Create a ranked list of possible new links based on the Jaccard similarity,
defined as the intersection of nodes divided by the union of nodes
parameters
G: Directed or undirected nx graph
returns
list of linkbunches with the score as an attribute
"""
pool = mp.Pool(processes=4)
G_undirected = nx.Graph(G)
results = pool.map(jaccard_prediction, nx.non_edges(G_undirected))
return results
Calling jaccard_serial_predictions(G)
with G being a graph of 95000000 potential edges returns in 4.5 minutes, but jaccard_mp_predictions(G)
does not return even after running for half an hour.
I'm not sure about this, but I think I'm spotting a potential slowdown. Compare the code for the serial operation on each node:
u = set(G.neighbors(non_edge[0]))
v = set(G.neighbors(non_edge[1]))
uv_un = len(u.union(v))
uv_int = len(u.intersection(v))
if uv_int == 0 or uv_un == 0:
continue
else:
s = (1.0*uv_int)/uv_un
potential_edges.append(non_edge + ({'score': s},))
With that for the parallel operation:
u = set(G.neighbors(non_edge[0]))
v = set(G.neighbors(non_edge[1]))
uv_un = len(u.union(v))
uv_int = len(u.intersection(v))
if uv_int == 0 or uv_un == 0:
return
else:
s = (1.0*uv_int)/uv_un
return non_edge + ({'score': s},)
In the serial version, whenever this condition uv_int == 0 or uv_un == 0
is true, you skip adding to the list. But in the parallelized version, you return None.
The mapping operation isn't smart enough to not add None to the list, while the serial operation just skips over those elements. That could lead to a slowdown, due to an additional appending operation for each non-scoreable element in the parallel version. If you have a lot of those, the slowdown could be huge!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.