[英]How to cluster different strings using machine learning in python
我有一个由建筑物名称组成的数据集。例如 {Hill View,Hills View,Hill Apartment...}。我想使用机器学习对这些字符串进行聚类。例如,在聚类后,一个聚类应该包含相似或有些相似的字符串{Hills,Hill...}。我尝试了各种 scikit 算法,如 K-means、Affinity Propagation 等,但没有成功。请帮助。
Machine Learning isn't magic!机器学习不是魔术! It uses mathematical objects and functions.它使用数学对象和函数。
You need first steps - usually known as Data Mining - which kind of consists in:您需要第一步——通常称为数据挖掘——包括:
Transforming any input (string, pictures, videos, anything...) to numbers (vectors, matrices or any relevent structure).将任何输入(字符串、图片、视频、任何东西...)转换为数字(向量、矩阵或任何相关结构)。
Defining distance and similarity between vectors (= distance between the numerical representation of your input ~= distance between string, pictures, videos, anything).定义向量之间的距离和相似性(= 输入的数字表示之间的距离 ~= 字符串、图片、视频等之间的距离)。
This is not trivial and can be done different ways depending on your data/objectives.这不是微不足道的,可以根据您的数据/目标以不同的方式完成。
Since I don't know your background in CS/ML/Maths, I could just give you a general approach which is, in a general case, quite good/easy.由于我不知道您在 CS/ML/数学方面的背景,我只能给您一个通用的方法,在一般情况下,它非常好/容易。
That is the general speach, in pratice this problematic is complex and there's a lot to learn on that.这是一般的演讲,实际上这个问题很复杂,有很多东西要学习。 You will most probably need the edit distance which is the most intuitive distance between words, you should also consider stemming which.您很可能需要编辑距离,这是单词之间最直观的距离,您还应该考虑提取哪个。
Can't give a better anwser without more information on data/context.如果没有关于数据/上下文的更多信息,就无法提供更好的 anwser。
Regards问候
Got it: Please follow this link for document clustering: http://brandonrose.org/clustering It gives an exact precise description.In order to convert it into normal string clustering where you have a list of names(strings) just pass the the list in place of the title list passed in the explanation.Also replace each occurrence of synopses list in the example with the list you want to cluster(in this case the list containing the strings to be clustered)明白了:请按照此链接进行文档聚类: http : //brandonrose.org/clustering它给出了精确的描述。为了将其转换为普通的字符串聚类,其中您有一个名称(字符串)列表,只需通过列表代替解释中传递的标题列表。还将示例中出现的每个概要列表替换为要聚类的列表(在本例中,列表包含要聚类的字符串)
You can skip few snippets since they provide extra information.Keeping them in the code will not harm you final clusters.您可以跳过一些片段,因为它们提供了额外的信息。将它们保留在代码中不会损害您的最终集群。
you can use the Naive Bayes algorithm for phrase clustering, for example in php您可以使用朴素贝叶斯算法进行短语聚类,例如在 php
$classifier = new \Niiknow\Bayes();
// teach it positive phrases
$classifier->learn('amazing, awesome movie!! Yeah!! Oh boy.', 'positive');
$classifier->learn('Sweet, this is incredibly, amazing, perfect, great!!', 'positive');
// teach it a negative phrase
$classifier->learn('terrible, shitty thing. Damn. Sucks!!', 'negative');
// now ask it to categorize a document it has never seen before
$classifier->categorize('awesome, cool, amazing!! Yay.');
// => 'positive'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.