简体   繁体   中英

what algorithm for discovering sequence of similar urls?

let's say a domain has a list of urls, with varying levels of path depth and similarity

url1/some/where/here
url1/some/where-2/here
url1/some-3/where/here
...
...
url1/some/where/here/right/now/1
url1/some/where/here/right/now/2
url1/some/where/here/right/now/3
url1/some/where/here/right-1/now/1
url1/some/where/here/right-1/now/2
url1/some/where/here/right-1/now/3
url1/some/where/here/right-2/now/1
url1/some/where/here/right-2/now/2
url1/some/where/here/right-2/now/3
url1/some/where/here/right-2/now/4
...

What algorithm can I use to cluster URL strings based on their density (number of slashes) and similarity (text distance, Levenshtein)?

so the output will be clustered into groups:

url1/some/where/here

url1/some/where-2/here

url1/some-3/where/here

url1/some/where/here/right/now/1
url1/some/where/here/right/now/2
url1/some/where/here/right/now/3

url1/some/where/here/right-1/now/1
url1/some/where/here/right-1/now/2
url1/some/where/here/right-1/now/3

url1/some/where/here/right-2/now/1
url1/some/where/here/right-2/now/2
url1/some/where/here/right-2/now/3
url1/some/where/here/right-2/now/4

url1/some-3/where/here/133

Some characteristics: - the more dense (or deeper) a url string is, the more relevant it is and likely to repeat in sequences. - similar chunk of urls repeat after another. dissimilar urls seem to be further away from the chunk of similar urls.

Is DBSCAN appropriate here?

(density, LV distance)

I thought of erasing the last characters up to the slash, and then searching for matches in subsequent strings. if the match is the next url in the list, they are likely to be a chunk. if the match is found further down the list, it's likely not part of any chunk.

    url1/some/where
this is almost found everywhere, thus not part of any chunk.

    url1/some/where/here/right-2/now/
finds 2 subsequent matches, found immediately after the candidate.

  url1/some/where/here/right-2/now/ 
finds 3 subsequently occuring matches. so they are chunked together.

  url1/some-3/where 
finds one other match at the way bottom, because of the distance, they are both not part of any chunk. Is there a name for this approach or something along the lines of this?

Yes, try DBSCAN

We don't have your data, so we don't know if it will work for you.

But DBSCAN (in Particular, Generalized DBSCAN) is very flexible and easy to adapt. In your case, you will need to formalize the similarity you have been discussing most of your question... Consider breaking URLs at slashes, and then treating each component as a token. That is probably the simplest approach.

Anyway: define your desired similarity, and then try out DBSCAN and OPTICS. And maybe share your experiences somewhere, so the next student can build upon it. Try to produce some shareable code, and give it back to the community; put in your name to get credit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM