简体   繁体   中英

SciKit One-class SVM classifier training time increases exponentially with size of training data

I am using the Python SciKit OneClass SVM classifier to detect outliers in lines of text. The text is converted to numerical features first using bag of words and TF-IDF.

When I train (fit) the classifier running on my computer, the time seems to increase exponentially with the number of items in the training set:

Number of items in training data and training time taken: 10K: 1 sec, 15K: 2 sec, 20K: 8 sec, 25k: 12 sec, 30K: 16 sec, 45K: 44 sec.

Is there anything I can do to reduce the time taken for training, and avoid that this will become too long when training data size increases to a couple of hundred thousand items ?

Well scikit's SVM is a high-level implementation so there is only so much you can do, and in terms of speed, from their website, "SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation."

You can increase your kernel size parameter based on your available RAM, but this increase does not help much.

You can try changing your kernel, though your model might be incorrect.

Here is some advice from http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use : Scale your data.

Otherwise, don't use scikit and implement it yourself using neural nets.

Hope I'm not too late. OCSVM, and SVM, is resource hungry, and the length/time relationship is quadratic (the numbers you show follow this). If you can, see if Isolation Forest or Local Outlier Factor work for you, but if you're considering applying on a lengthier dataset I would suggest creating a manual AD model that closely resembles the context of these off-the-shelf solutions. By doing this then you should be able to work either in parallel or with threads.

For anyone coming here from Google, sklearn has implemented SGDOneClassSVM , which "has a linear complexity in the number of training samples". It should be faster for large datasets.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM