简体   繁体   English

集群内部的集群是多类集群的数据表的嵌套集群

[英]clustering inside clustering that is nested clustering of a data table that is multiclass clustering

How to apply clustering of strings which are having similar name(like McDonald and Mc DOnald's) in a dataset and if string are same (like sam and other also sam) then how to again do clustering based on value or price for example- Consider a data table having 10 elements 如何在数据集中应用具有相似名称(如McDonald和Mc DOnald的字符串)的聚类,并且如果字符串相同(如sam和其他的sam),那么如何再次基于值或价格进行聚类-考虑一个具有10个元素的数据表

name           price
ram               200
shyam             150
ram12              59
gita               45
ram 2                45 
g11ita                23
john2                32
john                 7
jonh21               8
jonh                 38
ram22                3

Then grouping should be 然后分组

ram                    200

ram12                  59
ram  2                 45

ram22                   3

john2                    32
jonh                     37

john                    7
john21                   8

gita                 45
g11ita               23      

I have used string clustering using fuzzywuzzy and Levenheneitein distance but it only able to cluster string and does no able to cluster price How to cluster first string and if same then cluster price 我使用了使用Fuzzywuzzy和Levenheneitein距离的字符串聚类,但是它只能对字符串进行聚类,并且无法对价格进行聚类如何对第一个字符串进行聚类,如果相同则对价格进行聚类

You will need to carefully balance thresholds in textual similarity and in numerical similarity. 您将需要仔细平衡文本相似性和数字相似性中的阈值。 There won't be an easy solution, and unless you have really huge data, a manual approach may be best. 不会有一个简单的解决方案,除非您有大量数据,否则手动方法可能是最好的。

Textual similarity of short strings is highly unreliable. 短字符串的文本相似性非常不可靠。

For example: "dog" and "fog" only differ by a single letter, but are very unlikely typos. 例如:“ dog”和“ fog”仅相差一个字母,但很少出现错别字。 They have Levenshtein distance 1, the smallest non-zero value! 它们的Levenshtein距离为1,最小的非零值! Because of this, if you rely on Levenshtein, you will have plenty of false positives - okay if you manually verify them, but not for automatic processing. 因此,如果您依赖Levenshtein,则将有很多误报-如果您手动验证它们,而不能进行自动处理,则可以。

So at the minimum you'll need to use something that knows about (a) existing words, that are unlikely misspelled, (b) common misspellings, and (c) phonetic similarity to estimate how likely a word is misspelled, (d) keyboard similarity, how likely a word is mistyped... 因此,至少您需要使用一些知识来了解(a)不太可能拼错的现有单词,(b)常见拼写错误和(c)语音相似性,以估计单词拼写错误的可能性,(d)键盘相似性,一个单词输错的可能性有多大...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM