简体   繁体   中英

Cluster URLs based on their pattern using Python

I am new to clustering techniques and I highly value any input you can provide for my problem bellow. Basically, I want to cluster URLs based on their structural patterns. for example

  • cluster1 - simple URLs https://domain/path/file
  • cluster2 - shortened URLs
  • cluster3 - redirect URLs
  • ....
  • cluster k - new URL pattern

Given a URL dataset, I want to understand how many different URL pattern clusters exists and then visually see the difference.

What I see in the existing methods are clustering domain wise (cluster URLs of the same website together). And this is not what I am expecting. When I try the nlp based (word based) similarity clustering this is happening as the URLs of the same website tend to have same words with little differences.

Instead, I want to focus on the URL structure and identify URL patterns. Removing all the special characters and just creating a bag of words for each URL nullify the URL structure. Can anyone help me to identify a suitable clustering technique as well as a vectorizing technique to identify different URL pattern clusters.

Thanks in advance Matheesha

Here is an example of how to cluster text.

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = "kitten belly squooshy merley best eating google feedback face extension impressed map feedback google eating face extension climbing key".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))


 - *eating:* climbing, eating
 - *google:* google, squooshy
 - *feedback:* feedback
 - *face:* face, map
 - *impressed:* impressed
 - *extension:* extension
 - *key:* belly, best, key, kitten, merley

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM