简体   繁体   中英

How to cluster strings by Hamming or Levenshtein distance

As an exercise, I would like to cluster a set of English words by Hamming or Levenshtein distance. If it is Hamming distance they will all have to be the same length (or padded to the same length) but this isn't true for the Levenshtein distance.

I normally use scikit-learn which has a lot of clustering algorithms but none seem to accept arrays of categorical variables which is the most obvious way to represent a string.

I could precompute a massive distance matrix but this is unrealistic if the number of strings is at all large.

How can you cluster strings efficiently?

This seems relevant.

https://towardsdatascience.com/applying-machine-learning-to-classify-an-unsupervised-text-document-e7bb6265f52

This seems relevant too.

https://pythonprogramminglanguage.com/kmeans-text-clustering/

This example uses Affinity Propagation.

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
    
words = "kitten belly squooshy merley best eating google feedback face extension impressed map feedback google eating face extension climbing key".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))
    
    

# Result
 - *squooshy:* squooshy
 - *feedback:* feedback
 - *extension:* extension
 - *impressed:* impressed
 - *google:* google
 - *eating:* climbing, eating
 - *face:* face, map
 - *key:* belly, best, key, kitten, merley

Finally, I've been in the data science space for at least 8 years and I've heard of using Levenshtein Distance to compute cosine similarity, but I haven't see it used for clustering. Doing cosine similarity and clustering together, seems to make sense. Hopefully someone posts a solution here about that very thing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM