简体   繁体   中英

How to cluster different strings using machine learning in python

我有一个由建筑物名称组成的数据集。例如 {Hill View,Hills View,Hill Apartment...}。我想使用机器学习对这些字符串进行聚类。例如,在聚类后,一个聚类应该包含相似或有些相似的字符串{Hills,Hill...}。我尝试了各种 scikit 算法,如 K-means、Affinity Propagation 等,但没有成功。请帮助。

Machine Learning isn't magic! It uses mathematical objects and functions.

You need first steps - usually known as Data Mining - which kind of consists in:

  • Transforming any input (string, pictures, videos, anything...) to numbers (vectors, matrices or any relevent structure).

  • Defining distance and similarity between vectors (= distance between the numerical representation of your input ~= distance between string, pictures, videos, anything).

This is not trivial and can be done different ways depending on your data/objectives.

Since I don't know your background in CS/ML/Maths, I could just give you a general approach which is, in a general case, quite good/easy.

That is the general speach, in pratice this problematic is complex and there's a lot to learn on that. You will most probably need the edit distance which is the most intuitive distance between words, you should also consider stemming which.

Can't give a better anwser without more information on data/context.

Regards

Got it: Please follow this link for document clustering: http://brandonrose.org/clustering It gives an exact precise description.In order to convert it into normal string clustering where you have a list of names(strings) just pass the the list in place of the title list passed in the explanation.Also replace each occurrence of synopses list in the example with the list you want to cluster(in this case the list containing the strings to be clustered)

You can skip few snippets since they provide extra information.Keeping them in the code will not harm you final clusters.

you can use the Naive Bayes algorithm for phrase clustering, for example in php

$classifier = new \Niiknow\Bayes();

// teach it positive phrases

$classifier->learn('amazing, awesome movie!! Yeah!! Oh boy.', 'positive');
$classifier->learn('Sweet, this is incredibly, amazing, perfect, great!!', 'positive');

// teach it a negative phrase

$classifier->learn('terrible, shitty thing. Damn. Sucks!!', 'negative');

// now ask it to categorize a document it has never seen before

$classifier->categorize('awesome, cool, amazing!! Yay.');
// => 'positive'

relevant library at here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM