简体   繁体   中英

How to find if english words exist in string

I am trying to parse some web domains (tens of thousands) to see if they contain any English words.

It is easy for me to parse the domains to grab the main part of the domain with tldextract and then I tried to use enchant to see if they exist in the English dictionary.

The problem is I do not know how to split the domains in to multiple words to check, ie latimes returns as False but times would return as True.

Does anyone know a clever way to do look if there is an english word contained at all in the strings?

Thanks!

Unless you need to do that in a hurry, you could just chip off letters from the beginning or the end of the string, and check if it's a known word; if it is, cut it off and repeat. With eg 50k words 20 letters each, at worst you'll do 1M lookups. With a lookup taking eg 5ms (hitting an HDD every time), it will take 5000 seconds (about 1.5 hours), shorter than you'd spend coming up with a better algorithm.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM