简体   繁体   English

具有分类特征的线性回归分析

[英]Linear regression analysis with categorical feature

Regression algorithms working fine on represented as numbers.回归算法可以很好地表示为数字。 It's quite clear how to do regression on data which contains numbers and predict output. However I need to do regression analysis on data that contains categorical feature.很清楚如何对包含数字的数据进行回归并预测 output。但是我需要对包含分类特征的数据进行回归分析。 I have a csv file which contains two columns install-id and page-name both are object type.我有一个 csv 文件,其中包含两列 install-id 和 page-name 都是 object 类型。 I need to give install-id as input and page-name should be predicted as output. Below is my code.我需要将 install-id 作为输入,而 page-name 应该被预测为 output。下面是我的代码。 Please help me in this.请帮助我。

import pandas as pd
data = pd.read_csv("/Users/kashifjilani/Downloads/csv/newjsoncontent.csv")
X = data["install-id"]
Y = data["endPoint"]
X = pd.get_dummies(data=X, drop_first=True)
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression()
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)

For the demonstration, let's say, you have this dataframe where IQ and Gender are input features.对于演示,假设您有这个 dataframe,其中IQGender是输入特征。 Target variable is Test Score .目标变量是Test Score

|   Student |   IQ | Gender   |   Test Score |
|----------:|-----:|:---------|-------------:|
|         1 |  125 | Male     |           93 |
|         2 |  120 | Female   |           86 |
|         3 |  115 | Male     |           96 |
|         4 |  110 | Female   |           81 |
|         5 |  105 | Male     |           92 |
|         6 |  100 | Female   |           75 |
|         7 |   95 | Male     |           84 |
|         8 |   90 | Female   |           77 |
|         9 |   85 | Male     |           73 |
|        10 |   80 | Female   |           74 |

Here, IQ is numerical and Gender is a categorical feature.在这里, IQ是数字, Gender是分类特征。 In the preprocessing step, we'll apply simple imputer on the numerical and one-hot-encoder on the categorical feature.在预处理步骤中,我们将在数值特征上应用简单输入器,在分类特征上应用单热编码器。 You can use sklearn's Pipeline & ColumnTransformer feature for that.您可以为此使用sklearn's Pipeline & ColumnTransformer功能。 Then you can use your model of choice to train and predict easily.然后您可以使用您选择的 model 轻松训练和预测。

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn import linear_model

# defining the data
d = {
    "Student": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "IQ": [125, 120, 115, 110, 105, 100, 95, 90, 85, 80,],
    "Gender": [
        "Male",
        "Female",
        "Male",
        "Female",
        "Male",
        "Female",
        "Male",
        "Female",
        "Male",
        "Female",
    ],
    "Test Score": [93, 86, 96, 81, 92, 75, 84, 77, 73, 74],
}

# converting into pandas dataframe
df = pd.DataFrame(d)

# setting the student id as index to keep track
df = df.set_index("Student")

# column transformation
categorical_columns = ["Gender"]
numerical_columns = ["IQ"]

# determine X
X = df[categorical_columns + numerical_columns]
y = df["Test Score"]

# train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.3
)

# categorical pipeline
categorical_pipe = Pipeline([("onehot", OneHotEncoder(handle_unknown="ignore"))])

# numerical pipeline
numerical_pipe = Pipeline([("imputer", SimpleImputer(strategy="mean")),])

# aggregating both the pipeline
preprocessing = ColumnTransformer(
    [
        ("cat", categorical_pipe, categorical_columns),
        ("num", numerical_pipe, numerical_columns),
    ]
)


rf = Pipeline(
    [("preprocess", preprocessing), ("classifier", linear_model.LinearRegression())]
)

# train
rf.fit(X_train, y_train)

# predict
predict = rf.predict(X_test)

This shows,由此可见,

>> array([84.48275862, 84.55172414, 79.13793103])

I think here we have to remember the assumptions of Regression model. since we are trying to predict/identify trend between Independent Variable(X) and dependent variable(y).我认为在这里我们必须记住回归 model 的假设。因为我们试图预测/识别自变量 (X) 和因变量 (y) 之间的趋势。 - linearity separable - Independent variable have restricted multicollinearity. - 线性可分离 - 自变量具有受限的多重共线性。 - Homoscedasticity - 同方差性

As given in your example you have only one independent variable, and to summarize trend between X and y both should be linear.正如您的示例中给出的那样,您只有一个自变量,并且总结 X 和 y 之间的趋势都应该是线性的。

suppose for an example you have given a task to predict Total-Travel time of the trip.举个例子,假设你给了一个任务来预测旅行的总旅行时间。 and your data-set has following variables IV - Miles Traveled, NoOfDeliveries, GasPrice and City DV - Traveltime你的数据集有以下变量 IV - Miles Traveled、NoOfDeliveries、GasPrice 和 City DV - Traveltime

Here you can see it is a mixture of Numerical(Miles Traveled, GasPrice) + categorical variables(NoOfDeliveries, City).在这里您可以看到它是数值(行驶里程、GasPrice)+ 分类变量(NoOfDeliveries、City)的混合体。 now you have to encode these categorical variables to numbers (in order to work with regression analysis) and predict the output.现在您必须将这些分类变量编码为数字(以便进行回归分析)并预测 output。

To encode the categorical variables to Binary format we are using 2 objects from sklearn library here - LabelEncoder and OneHotEncoder.为了将分类变量编码为二进制格式,我们在这里使用 sklearn 库中的 2 个对象——LabelEncoder 和 OneHotEncoder。

Please follow the below links to know more about how to deal with Categorical variables请点击以下链接了解更多关于如何处理分类变量的信息

Please find the below link to know more about Dummy variable Trap请找到以下链接以了解有关虚拟变量 Trap的更多信息

Please find the below link to know more about building simple Linear regression model请找到以下链接以了解有关构建简单线性回归的更多信息 model

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM