简体   繁体   中英

Ordinary Least Squares Regression for multiple columns in Pandas Dataframe

I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1

    Time    A1      A2      A3      B1      B2      B3
1   1.00    6.64    6.82    6.79    6.70    6.95    7.02
2   2.00    6.70    6.86    6.92    NaN     NaN     NaN
3   3.00    NaN     NaN     NaN     7.07    7.27    7.40
4   4.00    7.15    7.26    7.26    7.19    NaN     NaN
5   5.00    NaN     NaN     NaN     NaN     7.40    7.51
6   5.50    7.44    7.63    7.58    7.54    NaN     NaN 
7   6.00    7.62    7.86    7.71    NaN     NaN     NaN

This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:

from sklearn.linear_model import LinearRegression

series = np.array([]) #blank list to append result

df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]

series= np.concatenate((SGR_trips, m), axis = 0)

As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.

I tried using a for loop such as:

for col in df1.columns:

and replacing 'A1', for example with col in the code, but this does not seem to be working.

Is there any way I can do this more efficiently?

Thank you!

One liner (or three)

time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
             ['Slope'], df.columns)

在此输入图像描述

Broken down with a bit of explanation

Using the closed form of OLS

在此输入图像描述

In this case X is time where we define time as df[['Time']] . I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.

在此输入图像描述

is np.linalg.pinv(time.T.dot(time)).dot(time.T)

Y is df.fillna(0) . Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaN s. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.

Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.

Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!

Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:

slopes = []

for c in cols:
    if c=="Time": break
    mask = ~np.isnan(df1[c])
    x = np.atleast_2d(df1.Time[mask].values).T
    y = np.atleast_2d(df1[c][mask].values).T
    reg = LinearRegression().fit(x, y)
    slopes.append(reg.coef_[0])

I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM