简体   繁体   中英

Word/Sentence similarity. What is the best approach?

I need to build an algorithm for product master data purposes and I'm not sure about the best NLP approach for this. The scenario is: - I have Product golden records; - I have many others Product catalogs that need to be harmonized; Example: - Product Golden Record: Coke and Coke Zero; - Products description that need to be hamonized: Coke 300ml, Coke Zero 300ml, Cke zero.

I need an algorithm that harmonize by similarity, since I have to consider typos and, sometimes, piece of a product in a sentence. Example: Coke zero JS MKT (JS and MKT are garbage, but the sentence is more similar to Coke Zero).

I've been testing some NLP for sentence similarity such as Bag of words as well as reading some other approaches such as Cosine Similarity and Levenshtein distance. However, I don't know what is the best option for my case.

Could you please help me to understand the best way to achieve what I need?

I have found two great solutions, by using Cosine similarity and Levenshtein distance. Im my case, Cosine similarity worked better, because I easily found part of the brand name into the text, so getting a score of 100% of accuracy. Matrix replacing (Levenshtein) was also good, but I good some errors due to very similar words in the dataset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM