简体   繁体   English

通过将sklearn.predict传递给df.apply,对Pandas数据帧进行逐行预测

[英]Row-wise prediction over Pandas dataframe by passing sklearn.predict to df.apply

Assuming we have a Pandas dataframe and a scikit-learn model, trained (fit) using that dataframe. 假设我们有一个Pandas数据框和一个scikit-learn模型,并使用该数据框进行了训练(拟合)。 Is there a way to do row-wise prediction? 有办法进行行预测吗? The use case is to use the predict function to fill in empty values in the dataframe, using an sklearn model. 用例是使用sklearn模型使用预测函数在数据框中填充空值。

I expected that this would be possible using the pandas apply function (with axis=1), but I keep getting dimensionality errors. 我希望使用pandas apply函数(轴= 1)能够做到这一点,但是我一直在遇到尺寸错误。

Using Pandas version '0.22.0' and sklearn version '0.19.1'. 使用Pandas版本'0.22.0'和sklearn版本'0.19.1'。

Simple example: 简单的例子:

import pandas as pd
from sklearn.cluster import kmeans

data = [[x,y,x*y] for x in range(1,10) for y in range(10,15)]

df = pd.DataFrame(data,columns=['input1','input2','output'])

model = kmeans()
model.fit(df[['input1','input2']],df['output'])

df['predictions'] = df[['input1','input2']].apply(model.predict,axis=1)

The resulting dimensionality error: 产生的尺寸误差:

ValueError: ('Expected 2D array, got 1D array instead:\narray=[ 1. 
10.].\nReshape your data either using array.reshape(-1, 1) if your data has 
a single feature or array.reshape(1, -1) if it contains a single sample.', 
'occurred at index 0')

Running predict on the whole column works fine: 在整个列上运行预测工作正常:

df['predictions'] = model.predict(df[['input1','input2']])

However, I want the flexibility to use this row-wise. 但是,我希望可以灵活地逐行使用。

I've tried various approaches to reshape the data first, for example: 我尝试了多种方法来重塑数据,例如:

def reshape_predict(df):
    return model.predict(np.reshape(df.values,(1,-1)))

df[['input1','input2']].apply(reshape_predict,axis=1)

Which just returns the input with no error, whereas I expect it to return a single column of output values (as an array). 它只返回没有错误的输入,而我希望它返回一列输出值(作为数组)。

SOLUTION: 解:

Thanks to Yakym for providing a working solution! 感谢Yakym提供了可行的解决方案! Trying a few variants based on his suggestion, the easiest solution was to simply wrap the row values in square brackets (I tried this previously, but without the 0 index for the prediction, with no luck). 根据他的建议尝试一些变体,最简单的解决方案是将行值包装在方括号中(我之前曾尝试过,但没有0的预测索引,没有运气)。

df['predictions'] = df[['input1','input2']].apply(lambda x: model.predict([x])[0],axis=1)

Slightly more verbose, you can turn each row into 2D array by adding new a new axis to the values. 稍微冗长些,您可以通过向值添加新的新轴来将每一行变成2D数组。 You will then have to access the prediction with 0 index: 然后,您将必须使用0索引访问预测:

df["predictions"] = df[["input1", "input2"]].apply(
    lambda s: model.predict(s.values[None])[0], axis=1
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM