简体   繁体   中英

Domain Name Matching algorithms

I have a list of companies and want to match domains fetched using google search as to which ones are likely to belong to the same company. Are there any existing algorithms that are available for this use case(legally allowed to be used in commercial project too).

eg I have Internet Movie Database as the company name and say google returns me the results out of which valid ones could be internetmoviedatabase, internet-movie-database, the-internet-movie-database, theinternetmoviedatabase, internetmovies, internet-movies, imd, imdb.(Note: I have excluded TLDs from the list to make the question simpler)

Sounds like you are looking for an approximate string matching algorithm. Not sure if you are looking for just the algorithm or an implementation.

There is already a question on it here: String matching algorithm

One possible solution is to use Levenshtein distance: http://en.wikipedia.org/wiki/Levenshtein_distance

If you are looking for an implementation, if you google "approximate string matching C++", this is the first result: http://www.chokkan.org/software/simstring/

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM