LIBSVM not predicting accurately even using training data

Question

I have the following code that gets a set of images, around 50 in each training set, and then creates a linear model and tries to classify the data. I have a testing set also but it can't even classify the training data with any kind of accuracy. Is there some error in the way that I'm loading the images in? I'd be glad to provide more code or my output if it would be helpful.

def create_image_list(file_path):
    image_list = []
    for filename in glob.glob(file_path):
        img = Image.open(filename)
        img_resized = img.resize((32, 32), Image.ANTIALIAS)
        pix = img.load()
        pixlist = []
        for x in range(0, 32):
            for y in range(0,32):
                pixlist.append(pix[x,y][0])
                pixlist.append(pix[x,y][1])
                pixlist.append(pix[x,y][2])
        image_list.append(pixlist)
    return image_list

dalmation_training = create_image_list('/images/dalmatian/training/*')
dollabill_training = create_image_list('/images/dollar_bill/training/*')
pizza_training = create_image_list('/images/pizza/training/*')
soccer_ball_training = create_image_list('/images/soccer_ball/training/*')
sunflower_training = create_image_list('/images/sunflower/training/*')

c = '1e2'
testing_set = dalmation_training + dollabill_training + pizza_training + soccer_ball_training + sunflower_training

dalmation_y = [1]*len(dalmation_training ) + [-1]*len(dollabill_training) + [-1]*len(pizza_training) + [-1]*len(soccer_ball_training) + [-1]*len(sunflower_training)
dalmation_model_linear = svm_train(dalmation_y, testing_set, '-t 0 -c %s -b 1 -q' % c)

dollabill_y = [-1]*len(dalmation_training ) + [1]*len(dollabill_training) + [-1]*len(pizza_training) + [-1]*len(soccer_ball_training) + [-1]*len(sunflower_training)
dollabill_model_linear = svm_train(dollabill_y, testing_set, "-t 0 -c %s -b 1 -q" % c)

pizza_y = [-1]*len(dalmation_training ) + [-1]*len(dollabill_training) + [1]*len(pizza_training) + [-1]*len(soccer_ball_training) + [-1]*len(sunflower_training)
pizza_model_linear = svm_train(pizza_y, testing_set, "-t 0 -c %s -b 1 -q" % c)

soccer_ball_y = [-1]*len(dalmation_training ) + [-1]*len(dollabill_training) + [-1]*len(pizza_training) + [1]*len(soccer_ball_training) + [-1]*len(sunflower_training)
soccer_ball_model_linear = svm_train(soccer_ball_y, testing_set, "-t 0 -c %s -b 1 -q" % c)

sunflower_y = [-1]*len(dalmation_training) + [-1]*len(dollabill_training) + [-1]*len(pizza_training) + [-1]*len(soccer_ball_training) + [1]*len(sunflower_training)
sunflower_model_linear = svm_train(sunflower_y, testing_set, "-t 0 -c %s -b 1 -q" % c)

print 'dalmation linear'
result1, something, p1 = svm_predict([1]*len(testing_set), testing_set, dalmation_model_linear, "-b 1")
print 'dollabill linear'
result2, something, p2 = svm_predict([1]*len(testing_set), testing_set, dollabill_model_linear, "-b 1")
print 'pizza linear'
result3, something, p3 = svm_predict([1]*len(testing_set), testing_set, pizza_model_linear, "-b 1")
print 'soccer linear'
result4, something, p4 = svm_predict([1]*len(testing_set), testing_set, soccer_ball_model_linear, "-b 1")
print 'sunflower linear'
result5, something, p5 = svm_predict([1]*len(testing_set), testing_set, sunflower_model_linear, "-b 1")

When I run this and run a few accuracy measurements it is around 20% everytime with that last dataset, the sunflowers being near 100% accuracy and the others near 5%. I believe that I am putting it in the correct format for libsvm and I can't find any clues. I have tried may different values of c from 1e-8 to 1e8 and it changed the accuracy slightly not more than 5% for each one.

Any input would be greatly appreciated and I'd be glad to give more info!

Answer 1

There is a big assumption you are making in your design and that is "the RGB pixel values of all the pixels in every sample of multiple classes create unique patterns that are linearly distinguishable". Based on my experience this is just not true. Most People working on image classification problems using SVM look for higher level features than pure RGB or intensity values in images (such as Edges, Corners, etc.) and there are already several known techniques that work relatively well in extracting useful features (such as HOG for pedestrian detection). This is by far the biggest problem with your code even though you may think that the next three sections answer your question about accuracy problems better.
Your negative training set is about 4 times larger than the positive training set. Libsvm by default does not handle such bias in training well, resulting in a heavily skewed hyperplane. It is quite possible that all your current SVM models return -1 for all testing samples anyhow. Whenever preparing a training set, adjust the number of negatives to almost match the number of positives by randomly selecting some negative samples.
Your test is incorrectly designed. You are passing the entire testing_set list to svm_predict and for true labels you pass [1]*len(testing_set) which is not correct. For dalmation model, the true class value should be dalmation_y calculated earlier.
Just remember that what you are doing here is "testing accuracy on the training samples" which is not a quite acceptable method of measuring accuracy. Instead you need to split your entire sample set into training and testing -- or even better into three section of training, validation, testing -- where training is about 3-4 times larger than the testing set, then train the model using the training set and test on the testing set.

LIBSVM not predicting accurately even using training data

Question

1 answers

solution1
2 2014-04-27 17:06:33

LIBSVM not predicting accurately even using training data

Question

1 answers

solution1 2 2014-04-27 17:06:33

solution1
2 2014-04-27 17:06:33