[英]How to use OneHotEncoder output in ordinary least squares regression plot
I have been trying to perform Ordinary Least Squares regression using the scikit-learn library but have hit another rock. 我一直在尝试使用scikit-learn库执行“普通最小二乘”回归,但遇到了另一个难题。
I have used OneHotEncoder to binarize my (independent) dummy/categorical features and I have an array like so: 我已经使用OneHotEncoder对我的(独立的)虚拟/分类特征进行二值化处理,并且我有一个像这样的数组:
x = [[ 1. 0. 0. ..., 0. 0. 0.]
[ 1. 0. 0. ..., 0. 0. 0.]
[ 0. 1. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 1. ..., 0. 0. 0.]
[ 1. 0. 0. ..., 0. 0. 0.]]
The dependent variables (Y) are stored in a one dimensional array. 因变量(Y)存储在一维数组中。 Everything is wonderful, except now when I come to plot these values I get an error:
一切都很棒,除了现在当我绘制这些值时,我得到一个错误:
# Plot outputs
pl.scatter(x_test, y_test, color='black')
ValueError: x and y must be the same size
When I use numpy.size on X and Y respectively it is clear thats a reasonable error: 当我分别在X和Y上使用numpy.size时,很明显这是一个合理的错误:
>>> print np.size(x)
5096
>>> print np.size(y)
98
Interestingly, the two sets of data are accepted by the fit method. 有趣的是,fit方法接受了两组数据。
My question is how can I transform the output of OneHotEncoder to use in my regression? 我的问题是如何转换OneHotEncoder的输出以用于回归?
If I understand you correctly, you have your X matrix as an input as an [mxn] matrix and some output Y of [nx 1], where m = number of features and n = number of data points. 如果我理解正确,您会将X矩阵作为[mxn]矩阵作为输入,并将某些输出Y设为[nx 1],其中m =特征数量,n =数据点数量。
Firstly, the linear regression fitting function will not care that X is of dimension [mxn] and Y of [nx 1] as it will simply use a parameter of dimension [1 xm], ie, 首先,线性回归拟合函数不会在乎X的尺寸为[mxn],Y的尺寸为[nx 1],因为它只会使用尺寸为[1 xm]的参数,即,
Y = theta * X
Unfortunately, as noted by eickenberg, you cannot plot all of the X features against the Y value using matplotlibs scatter call as you have, hence you get the error message of incompatible sizes, it wants to plot nxn not (nxm) x n. 不幸的是,正如eickenberg所指出的那样,您无法像以往一样使用matplotlibs散点图来针对Y值绘制所有X特征,因此会收到大小不兼容的错误消息,它想绘制nxn而不是(nxm)x n。
To fix your problem, try looking at a single feature at a time: 要解决您的问题,请尝试一次查看一个功能:
pl.scatter(x_test[:,0], y_test, color='black')
Assuming you have standardised your data (subtracted the mean and divided by the average) a quick and dirty way to see the trends would be plot all of them on a single axes: 假设您已经对数据进行了标准化(减去平均值并除以平均值),那么一种快速而肮脏的趋势查看方法就是将所有数据绘制在一个轴上:
fig = plt.figure(0)
ax = fig.add_subplot(111)
n, m = x_test.size
for i in range(m):
ax.scatter(x_test[:,m], y_test)
plt.show()
To visualise all at once on independent figures (depending on the number of features) then look at, eg, subplot2grid routines or another python module like pandas. 为了一次可视化所有独立图形(取决于功能的数量),然后查看例如subplot2grid例程或其他python模块(例如pandas)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.