I am trying to create unique clusters of universities that are within 50 miles of each other.
I have a dictionary that has a tuple with the universities' names as the keys and the distance between them as the values:
{('University A', 'University B'): 2546,
('University A', 'University C'): 2449,
('University A', 'University D'): 5,
('University A', 'University E'): 1005,
('University B', 'University C'): 32,
('University B', 'University D'): 132,
('University B', 'University E'): 42,
('University C', 'University D'): 532,
('University C', 'University E'): 1362}
I am able to filter these to get the pairs of universities that are within 50 miles of each other:
('University A', 'University D')
('University B', 'University C')
('University B', 'University E')
How can I iterate through these pairs and create sets of clusters? What I should end up with is a set with Universities A & D and another set with Universities B, C, & E.
There are 100s of universities that I am looking at in reality so the number of pairs is much longer. I am struggling with the creation of new sets within the iteration each time there is a new university cluster.
Incomplete answer, but hope it shows the idea to be tested and optimized.
Filter the keys as a set, then iterate over and use union if any of the pair is in the lookup list, which must be update while iterating. Better to show some code:
filtered = ([ set(k) for k,v in u.items() if v <= 50 ])
print(filtered) #=> [{'University A', 'University D'}, {'University B', 'University C'}, {'University B', 'University E'}]
lookup_list = filtered[1]
for pair in filtered:
if any(e in lookup_list for e in pair):
lookup_list = lookup_list.union(pair)
print(lookup_list)
#=> {'University B', 'University C', 'University E'}
With helpful guidance from @Daniel Mesejo & @Jon Clements and from this post , I ended up using networkx to solve the problem.
Starting from a list of tuples clusters
, looking like [('University A', 'University B'), ('University A', 'University C'), ...]
, I created the graph with:
g = nx.Graph()
for c in clusters :
g.add_edge(*c)
nx.draw(g)
plt.show()
Then to extract the clusters and give each a unique identifier using a key-value pair in a dictionary where the key is the cluster's number and the values are a list of the nodes (school names) in that cluster:
sub_graphs = list(nx.connected_component_subgraphs(g))
n = len(sub_graphs)
clusters = {}
for i in range(n) :
clusters[i+1] = list(sub_graphs[i].nodes())
And finally to map them back onto the original dataframe:
def map_cluster(x) :
for k, v in clusters.items() :
if x in v :
return k
df['Cluster'] = df['School Name'].apply(lambda x: map_cluster(x))
I am certain there is a more efficient way to do this and would welcome comments on this approach!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.