简体   繁体   中英

labelling of dataset in machine learning

I have a question about some basic concepts of machine learning. The examples, I observed, were giving a brief overview .For training the system, feature vector is given as input. In case of supervised learning, the dataset is labelled. I have confusion about labelling. For example if I have to distinguish between two types of pictures, I will provide a feature vector and on output side for testing, I'll provide 1 for type A and 2 for type B. But if I want to extract a region of interest from a dataset of images. How will I label my data to extract ROI using SVM. I hope I am able to convey my confusion. Thanks in anticipation.

In supervised learning, such as SVMs, the dataset should be composed as follows:

<i-th feature vector><i-th label>

where i goes from 1 to the number of patterns (also examples or observations ) in your training set so this represents a single record in your training set which can be used to train the SVM classifier.

So you basically have a set composed by such tuples and if you do have just 2 labels (binary classification problem) you can easily use a SVM. Indeed the SVM model will be trained thanks to the training set and the training labels and once the training phase has finished you can use another set (called Validation Set or Test Set), which is structured in the same way as the training set, to test the accuracy of your SVMs.
In other words the SVM workflow should be structured as follows:

  1. train the SVM using the training set and the training labels
  2. predict the labels for the validation set using the model trained in the previous step
  3. if you know what the actual validation labels are, you can match the predicted labels with the actual labels and check how many labels have been correctly predicted. The ratio between the number of correctly predicted labels and the total number of labels in the validation set returns a scalar between [0;1] and it's called the accuracy of your SVM model.
  4. if you're interested in the ROI, you might want to check the trained SVM parameters (mainly the weights and bias) to reconstruct the separation hyperplane

It is also important to know that the training set records should be correctly, a priori labelled : if the training labels are not correct, the SVM will never be able to correctly predict the output for previously unseen patterns. You do not have to label your data according to the ROI you want to extract, the data must be correctly labelled a priori: the SVM will have the entire set of type A pictures and the set of type B pictures and will learn the decision boundary to separate pictures of type A and pictures of type B. You do not have to trick the labels: if you do, you're not doing classification and/or machine learning and/or pattern recognition. You're basically tricking the results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM