简体繁体 English

如何对未标记的数据进行分类？

[英]How to classify unlabelled data?

原文 2019-04-08 15:02:18 6 2 python/ machine-learning/ classification

I am new to Machine Learning. 我是机器学习的新手。 I am trying to build a classifier that classifies the text as having a url or not having a url. 我正在尝试建立一个将文本分类为具有url或没有url的分类器。 The data is not labelled. 数据未标记。 I just have textual data. 我只有文字数据。 I don't know how to proceed with it. 我不知道该如何进行。 Any help or examples is appreciated. 任何帮助或示例表示赞赏。

2 个解决方案

You cannot train a classifier with unlabeled data. 您无法使用未标记的数据训练分类器。 You need labeled examples. 您需要标记的示例。 There are services that will label it for you, but it might be simpler for you to do it by hand (I assume you can go through one per minute). 有一些服务可以为您贴上标签，但是手工操作可能会更简单（我想您每分钟可以完成一次）。
Stack Overflow is for programming; 堆栈溢出用于编程； this question would be better suited in, say, Cross-Validated . 例如，“ 交叉验证”更适合该问题。 Maybe they'll have better suggestions than me. 也许他们会比我有更好的建议。
After you've labeled the data, there's a lot of info on the web on this subject - for example, this blog is a good place to start if you already have some grip on the issue. 在为数据加标签后，网络上有很多关于此主题的信息-例如，如果您已经对此问题有所了解，那么此博客是一个不错的起点。

Good luck! 祝好运！

Since it's text, you can use bag of words technique to create vectors. 由于是文本，因此可以使用bag of words技术创建矢量。

You can use cosine similarity to cluster the common type text. 您可以使用cosine similarity来聚类普通类型的文本。
Then use classifier, which would depend on number of clusters. 然后使用分类器，这取决于群集的数量。
This way you have a labeled training set. 这样，您就可以得到带有标签的训练集。
- If you have two cluster, binary classifier like logistic regression would work. 如果您有两个集群，则像逻辑回归这样的二进制分类器将起作用。
- If you have multiple classes, you need to train model based on multinomial logistic regression 如果您有多个类别，则需要基于多项逻辑回归训练模型
- or train multiple logistic models using One vs Rest technique. 或使用One vs Rest技术训练多个逻辑模型。
Lastly, you can test your model using k-fold cross validation. 最后，您可以使用k倍交叉验证来测试模型。