Python多元简单线性回归

Question

Note this is not a question about multiple regression, it is a question about doing simple (single-variable) regression multiple times in Python/NumPy (2.7). 请注意，这不是关于多元回归的问题，而是在Python / NumPy（2.7）中多次执行简单（单变量）回归的问题。

I have two m x n arrays x and y . 我有两个m x n阵列x和y 。 The rows correspond to each other, and each pair is the set of (x,y) points for a measurement. 行彼此对应，并且每对是用于测量的（x，y）点的集合。 That is, plt.plot(xT, yT, '.') would plot each of m datasets/measurements. 也就是说， plt.plot(xT, yT, '.')将绘制m个数据集/测量值中的每一个。

I'm wondering what the best way to perform the m linear regressions is. 我想知道执行m线性回归的最佳方法是什么。 Currently I loop over the rows and use scipy.stats.linregress() . 目前我循环遍历行并使用scipy.stats.linregress() 。 (Assume I don't want solutions based on doing linear algebra with the matrices but instead want to work with this function, or an equivalent black-box function.) I could try np.vectorize , but the docs indicate it also loops. （假设我不希望基于对矩阵进行线性代数的解决方案，而是希望使用此函数或等效的黑盒函数。）我可以尝试np.vectorize ，但文档表明它也循环。

With some experimenting, I've also found a way to use list comprehensions with map() and get correct results. 通过一些实验，我还找到了一种方法来使用map()和map()并获得正确的结果。 I've put both solutions below. 我把两种解决方案都放在了下面。 In IPython, `%%timeit`` returns, using a small dataset (commented out): 在IPython中，`%% timeit``返回，使用一个小数据集（注释掉）：

(loop) 1000 loops, best of 3: 642 µs per loop
(map) 1000 loops, best of 3: 634 µs per loop

To try magnifying this, I made a much bigger random dataset (dimension trials x trials ): 为了尝试放大这个，我做了一个更大的随机数据集（维度trials x trials ）：

(loop, trials = 1000)  1 loops, best of 3: 299 ms per loop
(loop, trials = 10000) 1 loops, best of 3: 5.64 s per loop
(map, trials = 1000)   1 loops, best of 3: 256 ms per loop
(map, trials = 10000)  1 loops, best of 3: 2.37 s per loop

That's a decent speedup on a really big set, but I was expecting a bit more. 这在一个非常大的集合上是一个不错的加速，但我期待更多。 Is there a better way? 有没有更好的办法？

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
np.random.seed(42)
#y = np.array(((0,1,2,3),(1,2,3,4),(2,4,6,8)))
#x = np.tile(np.arange(4), (3,1))
trials = 1000
y = np.random.rand(trials,trials)
x = np.tile(np.arange(trials), (trials,1))
num_rows = shape(y)[0]
slope = np.zeros(num_rows)
inter = np.zeros(num_rows)
for k, xrow in enumerate(x):
    yrow = y[k,:]
    slope[k], inter[k], t1, t2, t3 = stats.linregress(xrow, yrow)
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope + intercept)
# Can the loop be removed?
tempx = [x[k,:] for k in range(num_rows)]
tempy = [y[k,:] for k in range(num_rows)]
results = np.array(map(stats.linregress, tempx, tempy))
slope_vec = results[:,0]
inter_vec = results[:,1]
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope_vec + inter_vec)
print "Slopes equal by both methods?: ", np.allclose(slope, slope_vec)
print "Inters equal by both methods?: ", np.allclose(inter, inter_vec)

Answer 1

Single variable linear regression is simple enough to vectorize it manually: 单变量线性回归很简单，可以手动对其进行矢量化：

def multiple_linregress(x, y):
    x_mean = np.mean(x, axis=1, keepdims=True)
    x_norm = x - x_mean
    y_mean = np.mean(y, axis=1, keepdims=True)
    y_norm = y - y_mean

    slope = (np.einsum('ij,ij->i', x_norm, y_norm) /
             np.einsum('ij,ij->i', x_norm, x_norm))
    intercept = y_mean[:, 0] - slope * x_mean[:, 0]

    return np.column_stack((slope, intercept))

With some made up data: 有一些组成的数据：

m = 1000
n = 1000
x = np.random.rand(m, n)
y = np.random.rand(m, n)

it outperforms your looping options by a fair margin: 它以合理的优势超越您的循环选项：

%timeit multiple_linregress(x, y)
100 loops, best of 3: 14.1 ms per loop

Python多元简单线性回归

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-03-25 23:48:01

Python多元简单线性回归

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-03-25 23:48:01

解决方案1
3 已采纳 2014-03-25 23:48:01