简体   繁体   English

如何使用scikit-learn执行多元线性回归?

[英]How to perform multivariable linear regression with scikit-learn?

Forgive my terminology, I'm not an ML pro. 原谅我的术语,我不是ML专业人士。 I might use the wrong terms below. 我可能在下面使用错误的术语。

I'm trying to perform multivariable linear regression. 我正在尝试执行多变量线性回归。 Let's say I'm trying to work out user gender by analysing page views on a web site. 假设我正在尝试通过分析网站上的页面浏览量来确定用户性别。

For each user whose gender I know, I have a feature matrix where each row represents a web site section, and the second element whether they visited it, eg: 对于我认识的每个性别的用户,我都有一个特征矩阵,其中每一行代表一个网站部分,第二个元素是他们是否访问过网站部分,例如:

male1 = [
    [1, 1],     # visited section 1
    [2, 0],     # didn't visit section 2
    [3, 1],     # visited section 3, etc
    [4, 0]
]

So in scikit, I am building xs and ys . 因此,在scikit中,我正在构建xsys I'm representing a male as 1, and female as 0. 我代表男性为1,女性为0。

The above would be represented as: 以上将表示为:

features = male1
gender = 1

Now, I'm obviously not just training a model for a single user, but instead I have tens of thousands of users whose data I'm using for training. 现在,我显然不仅在为单个用户训练模型,而且还有数以万计的用户正在使用我的数据进行训练。

I would have thought I should create my xs and ys as follows: 我本以为应该按如下方式创建xsys

xs = [
    [          # user1
       [1, 1],    
       [2, 0],     
       [3, 1],    
       [4, 0]
    ],
    [          # user2
       [1, 0],    
       [2, 1],     
       [3, 1],    
       [4, 0]
    ],
    ...
]

ys = [1, 0, ...]

scikit doesn't like this: scikit不喜欢这样:

from sklearn import linear_model

clf = linear_model.LinearRegression()
clf.fit(xs, ys)

It complains: 它抱怨:

ValueError: Found array with dim 3. Estimator expected <= 2.

How am I supposed to supply a feature matrix to the linear regression algorithm in scikit-learn? 我应该如何在scikit-learn中为线性回归算法提供特征矩阵?

You need to create xs in a different way. 您需要以其他方式创建xs According to the docs : 根据文档

 fit(X, y, sample_weight=None) 

Parameters: 参数:

  X : numpy array or sparse matrix of shape [n_samples, n_features] Training data y : numpy array of shape [n_samples, n_targets] Target values sample_weight : numpy array of shape [n_samples] Individual weights for each sample 

Hence xs should be a 2D array with as many rows as users and as many columns as web site sections. 因此, xs应该是一个2D数组,其行数与用户数相同,列数与网站部分相同。 You defined xs as a 3D array though. 您将xs定义为3D数组。 In order to reduce the number of dimensions by one you could get rid of the section numbers through a list comprehension: 为了将尺寸数减少一,您可以通过列表理解来摆脱节号:

xs = [[visit for section, visit in user] for user in xs]

If you do so, the data you provided as an example gets transformed into: 如果这样做,您作为示例提供的数据将转换为:

xs = [[1, 0, 1, 0], # user1
      [0, 1, 1, 0], # user2
      ...
      ]

and clf.fit(xs, ys) should work as expected. clf.fit(xs, ys)应该可以正常工作。

A more efficient approach to dimension reduction would be that of slicing a NumPy array: 减少维度的更有效方法是切片NumPy数组:

import numpy as np
xs = np.asarray(xs)[:,:,1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM