简体   繁体   English

机器学习中数据集的标注

[英]labelling of dataset in machine learning

I have a question about some basic concepts of machine learning. 我对机器学习的一些基本概念有疑问。 The examples, I observed, were giving a brief overview .For training the system, feature vector is given as input. 我观察到的示例仅作了简要概述。为训练系统,将特征向量作为输入。 In case of supervised learning, the dataset is labelled. 在监督学习的情况下,数据集被标记。 I have confusion about labelling. 我对标签感到困惑。 For example if I have to distinguish between two types of pictures, I will provide a feature vector and on output side for testing, I'll provide 1 for type A and 2 for type B. But if I want to extract a region of interest from a dataset of images. 例如,如果我必须区分两种类型的图片,我将提供一个特征向量,并在输出侧进行测试,我将为A类型提供1,为B类型提供2。但是,如果我要提取感兴趣的区域从图像数据集中。 How will I label my data to extract ROI using SVM. 如何使用SVM标记数据以提取ROI。 I hope I am able to convey my confusion. 我希望我能传达我的困惑。 Thanks in anticipation. 谢谢您的期待。

In supervised learning, such as SVMs, the dataset should be composed as follows: 在诸如SVM的监督学习中,数据集应组成如下:

<i-th feature vector><i-th label>

where i goes from 1 to the number of patterns (also examples or observations ) in your training set so this represents a single record in your training set which can be used to train the SVM classifier. i从1到训练集中的模式数量(也包括示例观察值 ),因此这代表了训练集中的一条记录,可用于训练SVM分类器。

So you basically have a set composed by such tuples and if you do have just 2 labels (binary classification problem) you can easily use a SVM. 因此,基本上,您有一个由此类元组组成的集合,如果确实只有2个标签(二进制分类问题),则可以轻松使用SVM。 Indeed the SVM model will be trained thanks to the training set and the training labels and once the training phase has finished you can use another set (called Validation Set or Test Set), which is structured in the same way as the training set, to test the accuracy of your SVMs. 确实,借助训练集和训练标签将可以对SVM模型进行训练,并且一旦训练阶段完成,您就可以使用另一组(称为验证集或测试集),其结构与训练集的结构相同,测试您的SVM的准确性。
In other words the SVM workflow should be structured as follows: 换句话说,SVM工作流程的结构应如下:

  1. train the SVM using the training set and the training labels 使用训练集和训练标签训练SVM
  2. predict the labels for the validation set using the model trained in the previous step 使用上一步中训练的模型预测验证集的标签
  3. if you know what the actual validation labels are, you can match the predicted labels with the actual labels and check how many labels have been correctly predicted. 如果您知道实际的验证标签是什么,则可以将预测的标签与实际的标签进行匹配,并检查已正确预测了多少个标签。 The ratio between the number of correctly predicted labels and the total number of labels in the validation set returns a scalar between [0;1] and it's called the accuracy of your SVM model. 正确预测的标签数量与验证集中的标签总数之间的比率返回一个[0; 1]之间的标量,这被称为SVM模型的准确性
  4. if you're interested in the ROI, you might want to check the trained SVM parameters (mainly the weights and bias) to reconstruct the separation hyperplane 如果您对ROI感兴趣,则可能需要检查训练有素的SVM参数(主要是权重和偏差)以重建分离超平面

It is also important to know that the training set records should be correctly, a priori labelled : if the training labels are not correct, the SVM will never be able to correctly predict the output for previously unseen patterns. 同样重要的是要知道训练集记录应该正确,并带有先验标记 :如果训练标签不正确,则SVM将永远无法正确预测以前看不见的模式的输出。 You do not have to label your data according to the ROI you want to extract, the data must be correctly labelled a priori: the SVM will have the entire set of type A pictures and the set of type B pictures and will learn the decision boundary to separate pictures of type A and pictures of type B. You do not have to trick the labels: if you do, you're not doing classification and/or machine learning and/or pattern recognition. 您不必根据要提取的ROI标记数据,数据必须先验地正确标记:SVM将具有整套A类图片和B类图片集,并将了解决策边界可以将类型A的图片和类型B的图片分开。您不必弄乱标签:如果这样做,就不必进行分类和/或机器学习和/或模式识别。 You're basically tricking the results. 您基本上是在欺骗结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM