简体   繁体   English

将系数分配回多元线性回归中的分类变量

[英]Assign coefficients back to categorical variables in Multiple Linear Regression

From running a multiple linear regression using Sciki-learn, I need to obtain a equation like Y= a + bX1 + cX2 + dX2 + eX3 + fX4 + gX5 where b, c, d, e, f and g are coefficients of each of the independent variables.通过使用 Sciki-learn 运行多元线性回归,我需要获得一个方程,如 Y= a + bX1 + cX2 + dX2 + eX3 + fX4 + gX5 其中 b、c、d、e、f 和 g 是每个的系数自变量。

I have performed multiple linear regression using Scikit-learn with 3 categorical variables (Cat V) and 2 continuous variables (Cont V) as below我使用 Scikit-learn 执行了多元线性回归,其中包含 3 个分类变量(Cat V)和 2 个连续变量(Cont V),如下所示

    Cat V 1    Cat V 2    Cat V 3    Cont V 1    Cont V 2
    A          C3         X2         208         3000
    B          C6         X4         256         4000
    B          C7         X5         275         2000
    C          C2         X1         508         3200

I have encoded the categorical data using column transformer, which has resulted in many more columns, as each categorical variable has more than 10 different categories.我使用列转换器对分类数据进行了编码,这导致了更多的列,因为每个分类变量都有超过 10 个不同的类别。 The code I have used to perform this is below我用来执行此操作的代码如下

    # Encoding categorical data
    mct = make_column_transformer((OneHotEncoder(drop='first'), [0, 1, 2]), remainder = 'passthrough')
    X = mct.fit_transform(X)

    # Splitting the dataset into the Training set and Test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
    test_size=0.2, random_state = 0)

    # Fitting Multiple Linear Regression to the Training set
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)

I have found the coefficients of each variable (after encoding) using the the [.coef_] function with following code我使用 [.coef_] function 和以下代码找到了每个变量的系数(编码后)

    print(regressor.coef_)

The problem is this show coefficients of the variables after being split in encoding, like below问题是这显示了在编码中拆分后的变量系数,如下所示

    [ 1.80198679e-05 -5.55304459e-05  1.90462615e-03 -6.22320276e-05
  1.17184589e-03  .... -2.33744077e-03 -1.91538011e-04
  8.61626216e-11  3.73358813e-03]

I need to find the 5 coefficients of the original 5 variables.我需要找到原始 5 个变量的 5 个系数。 like喜欢

    Cat V 1     Coefficient 1
    Cat V 2     Coefficient 2
    Cat V 3     Coefficient 3
    Cont V 1    Coefficient 4
    Cont V 2    Coefficient 5

Is it possible to do this?是否有可能做到这一点?

Linear regression means you're searching f in y=f(x), or y=f(x1,x2..) for continuous variables.线性回归意味着您在 y=f(x) 或 y=f(x1,x2..) 中搜索 f 以获得连续变量。 The machinery does not work for categories: it thinks that a variable corresponding to a category can smoothly vary between C2 and C3, C3 and C4, etc. When you created several columns, maybe things got worse: now you have more variables that try to accomodate the shape of f() - see what I mean?该机制不适用于类别:它认为对应于类别的变量可以在 C2 和 C3、C3 和 C4 等之间平滑变化。当您创建多个列时,情况可能会变得更糟:现在您有更多变量试图适应 f() 的形状 - 明白我的意思吗? Think of a single column of categories, y=f(c);考虑一列类别,y=f(c); now you have y=f(c1,c2...), each continuously varying and, this way, mixing categories together in small amounts (your coefficients, as 10^-5, 10^-6, etc.).现在您有了 y=f(c1,c2...),每个都在不断变化,并且通过这种方式,将类别少量混合在一起(您的系数,如 10^-5、10^-6 等)。

Logistic regression employs af() with a curious shape (the sigmoid) with extreme values 0 and 1 and a ramp in between;逻辑回归使用 af() 具有奇怪的形状(sigmoid),极值 0 和 1 以及介于两者之间的斜坡; it is continuous between Cx and Cy but has a sudden jump.它在 Cx 和 Cy 之间是连续的,但有一个突然的跳跃。 It is often associated with this type of problem.它通常与此类问题有关。 Neural Networks as multi-layer perceptron are nothing but regression decorated with fancy names as AI, neural, etc. Does it solve your problem?作为多层感知器的神经网络只不过是用 AI、神经等花哨名称装饰的回归。它解决了你的问题吗? It depends - period.这取决于 - 期间。 But dozens of papers were published by running such a regression, tweaking parameters and 'learning' algorithms and tagging the whole thing with hot-topic-words.但是通过运行这样的回归、调整参数和“学习”算法并用热门话题标记整个事情,发表了数十篇论文。

If - and only if - there's some logic in the idea of transitioning from one category to another (suppose an object might be in an intermediary state), you might code your categories as numbers.如果 - 并且仅当 - 从一个类别转换到另一个类别的想法中有一些逻辑(假设 object 可能处于中间状态),您可以将您的类别编码为数字。 Maybe C1=1, C2=2, etc. At the end, the continuous values might be an indication of a variable approximately matching a category - or none of that, simply the variable was distorted enough to make f() fit the best the outputs y1,y2.. you provided.也许 C1=1、C2=2 等等。最后,连续值可能表明一个变量与某个类别近似匹配 - 或者都不是,只是变量被扭曲到足以使 f() 最适合输出 y1,y2.. 你提供的。 See, there's no definitive answer here.看,这里没有确定的答案。 Any way you do it, is approximative.任何你做的方式,都是近似的。

Instead of using linear regression you might try to fit another curve (eg parabolic, sin..), but that brings a bunch of new problems.您可能会尝试拟合另一条曲线(例如抛物线、sin..),而不是使用线性回归,但这会带来一堆新问题。 The MLP (perceptron) is a summation of sigmoids and has nice approximmation capabilities (compared with parabola, sin...), hence the interest on it. MLP(感知器)是 sigmoid 的总和,具有很好的逼近能力(与抛物线、sin 相比),因此对它很感兴趣。

Then, there's SVM (Support Vector Machine), another beast in the scene;然后是 SVM(支持向量机),场景中的另一只野兽; same basic idea, but you work with something as y=f(g(x)) for some crazy g() which makes it easy to find f().相同的基本思想,但是您使用 y=f(g(x)) 来处理一些疯狂的 g(),这使得找到 f() 变得容易。

Another shot, things as Tree Decision Learning and Case Based Reasoning;另一个镜头,诸如树决策学习和基于案例的推理; this can be performed with tools as RapidMiner with weka plugin, or weka itself.这可以使用带有 weka 插件的 RapidMiner 或 weka 本身的工具来执行。

Simple linear regression is a complicate problem - not because of the math (which can be presented in horrible ways), but because of the subtleties around the data and how it represents something in the real world.简单的线性回归是一个复杂的问题 - 不是因为数学(可能以可怕的方式呈现),而是因为数据的微妙之处以及它如何代表现实世界中的某些事物。 And.. you have something more difficult than simple linear regression (sorry for the bad news).而且..您有比简单的线性回归更困难的事情(抱歉,坏消息)。 Hope you find an acceptable solution.希望你能找到一个可以接受的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM