fillna with linear regression model built from two columns in dataframe pandas

Question

I have dataframe that looks like this:

    sepal length    sepal width   petal length  petal width     target
0      4.9              3.5            1.4          0.2         setosa
1      4.9              3.0            1.4          0.2         setosa
2      4.7              3.2            1.3          0.2         setosa
3      4.6              3.1            1.5          0.2         setosa
4      5.0              3.6            1.4          NaN         setosa
      ...

I've created LinearRegression() model using petal width and petal length. Now I want to use linear_regression model I've trained to fill NaN values.

Here is what I've tried, it works however I am curious to know if there is more efficient way.

def fillna_linear_reg(length, width):
    if pd.isna(length):
        pred_length = lin_reg.predict([[width]]) 
        return pred_length[0][0]
    else:
        return length

iris_df["petal length (cm)"] = iris_df.apply(lambda x: fillna_linear_reg(x["petal length (cm)"], x["petal width (cm)"]), axis=1)

Thanks in advance!

Answer 1

Yes, there is a more efficient way. You could use predict and assign all missing values at once. Avoid using df.apply whenever possible. It kills the performance, especially when used with other vectorizable functions like predict (or even already vectorized) methods of (I assume so) sklearn models.

def fillna_linear_reg(lin_reg, length, width):
    nan_mask = length.isna()
    pred_length = lin_reg.predict(width.loc[nan_mask])
    length.loc[nan_mask] = pred_length

fillna_linear_reg(
    lin_reg, iris_df.loc[:, "petal length (cm)"], iris_df.loc[:, "petal width (cm)"]
)

Depending on the machine learning module you used for training, you may need to pass the x-data to the predict method as a 2d-array and squeeze back to a 1d-array. If so, you can replace the line including the prediction with:

pred_length = np.squeeze(lin_reg.predict(np.atleast_2d(width.loc[nan_mask])))

This can be of course simplified, if you add explicit shape information.

fillna with linear regression model built from two columns in dataframe pandas

Question

1 answers

solution1
3 ACCPTED 2021-01-02 11:01:59

fillna with linear regression model built from two columns in dataframe pandas

Question

1 answers

solution1 3 ACCPTED 2021-01-02 11:01:59

solution1
3 ACCPTED 2021-01-02 11:01:59