简体   繁体   English

在R中分析SVM(e1071)

[英]Profiling SVM (e1071) in R

I am new to R and SVMs and I am trying to profile svm function from e1071 package. 我是R和SVM的新手,我正在尝试从e1071包中分析svm功能。 However, I can't find any large dataset that allows me to get a good profiling range of results varying the size of the input data. 但是,我找不到任何大型数据集,这些数据集允许我获得改变输入数据大小的良好分析范围。 Does anyone know how to work svm out? 有谁知道如何工作svm Which dataset should I use? 我应该使用哪个数据集? Any particular parameters to svm that makes it work harder? svm任何特定参数使其更难工作?

I copy some commands that I am using to test the performance. 我复制了一些用于测试性能的命令。 Perhaps it is most useful and easier to get what I am trying here: 也许最有用也更容易得到我在这里尝试的东西:

#loading libraries
library(class)
library(e1071)
#I've been using golubEsets (more examples availables)
library(golubEsets)

#get the data: matrix 7129x38
data(Golub_Train)
n <- exprs(Golub_Train)

#duplicate rows(to make the dataset larger)
n<-rbind(n,n)

#take training samples as a vector
samplelabels <- as.vector(Golub_Train@phenoData@data$ALL.AML)

#calculate svm and profile it
Rprof('svm.out')
svmmodel1 <- svm(x=t(n), y=samplelabels, type='C', kernel="radial", cross=10)
Rprof(NULL)

I keep increasing the dataset duplicating rows and columns but I reached the limit of memory instead of making svm works harder... 我不断增加数据集重复行和列,但我达到了内存的极限,而不是让svm更难工作......

In terms of "working SVM out" - what will make SVM work "harder" is a more complex model which is not easily separated, higher dimensionality and a larger, denser dataset. 在“工作SVM”方面 - 什么使SVM工作“更难”是一个更复杂的模型,不容易分离,更高的维度和更大,更密集的数据集。

SVM performance degrades with: SVM性能下降:

  • Dataset size increases (number of data points) 数据集大小增加(数据点数)
  • Sparsity decreases (fewer zeros) 稀疏度降低(零个更少)
  • Dimensionality increases (number of attributes) 维度增加(属性数量)
  • Non-linear kernels are used (and kernel parameters can make the kernel evaluation more complex) 使用非线性内核(内核参数可以使内核评估更复杂)

Varying Parameters 改变参数

Are there parameters you can change to make SVM take longer. 您是否可以更改参数以使SVM花费更长时间。 Of course the parameters affect the quality of the solution you will get and may not make any sense to use. 当然,参数会影响您将获得的解决方案的质量,并且可能没有任何意义。

Using C-SVM, varying C will result in different runtimes. 使用C-SVM,变化的C将导致不同的运行时间。 (The similar parameter in nu-SVM is nu) If the dataset is reasonably separable, making C smaller will result in a longer runtime because the SVM will allow more training points to become support vectors. (nu-SVM中的相似参数是nu)如果数据集是合理可分的,那么使C更小将导致更长的运行时间,因为SVM将允许更多的训练点成为支持向量。 If the dataset is not very separable, making C bigger will cause longer run times because you are essentially telling SVM you want a narrow-margin solution which fits tightly to the data and that will take much longer to compute when the data doesn't easily separate. 如果数据集不是很可分离,那么使C更大将导致更长的运行时间,因为您实际上告诉SVM您需要一个与数据紧密匹配的窄边距解决方案,并且当数据不容易计算时需要更长的时间来计算分离。

Often you find when doing a parameter search that there are parameters that will increase computation time with no appreciable increase in accuracy. 通常,您在进行参数搜索时会发现,有些参数会增加计算时间而精度没有明显提高。

The other parameters are kernel parameters and if you vary them to increase the complexity of calculating the kernel then naturally the SVM runtime will increase. 其他参数是内核参数,如果你改变它们以增加计算内核的复杂性,那么SVM运行时自然会增加。 The linear kernel is simple and will be the fastest; 线性内核很简单,速度最快; non-linear kernels will of course take longer. 非线性内核当然需要更长的时间。 Some parameters may not increase the calculation complexity of the kernel, but will force a much more complex model, which may take SVM much longer to find the optimal solution to. 一些参数可能不会增加内核的计算复杂性,但会强制使用更复杂的模型,这可能需要更长的SVM才能找到最佳解决方案。

Datasets to Use: 要使用的数据集:

The UCI Machine Learning Repository is a great source of datasets. UCI机器学习库是数据集的重要来源。

The MNIST handwriting recognition dataset is a good one to use - you can randomly select subsets of the data to create increasingly larger sized datasets. MNIST手写识别数据集是一个很好用的 - 您可以随机选择数据的子集来创建越来越大的数据集。 Keep in mind the data at the link contains all digits, SVM is of course binary so you would have to reduce the data to just two digits or do some kind of multi-class SVM. 请记住,链接中的数据包含所有数字,SVM当然是二进制的,因此您必须将数据减少到两位数或执行某种多类SVM。

You can easily generate datasets as well. 您也可以轻松生成数据集 To generate a linear dataset, randomly select a normal vector to a hyperplane, then generate a datapoint and determine which side of the hyperplane it falls on to label it. 要生成线性数据集,随机选择超平面的法向量,然后生成数据点并确定它所在的超平面的哪一侧以标记它。 Add some randomness to allow points within a certain distance of the hyperplane to sometimes be labeled differently. 添加一些随机性以允许超平面的特定距离内的点有时被不同地标记。 Increase the complexity by increasing that overlap between classes. 通过增加类之间的重叠来增加复杂性。 Or generate some numbers of clusters of normally distributed points, labeled either 1 or -1, so that the distributions overlap at the edges. 或者生成一些正态分布点的簇,标记为1或-1,以便分布在边缘重叠。 The classic non-linear example is a checkerboard. 经典的非线性示例是棋盘格。 Generate points and label them in a checkerboard pattern. 生成点并以棋盘图案标记它们。 To make it more difficult enlarge the number of squares, increase the dimensions and increase the number of datapoints. 更难以扩大正方形的数量,增加尺寸并增加数据点的数量。 You will have to use a non-linear kernel for that of course. 当然,你必须使用非线性内核。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM