[英]Which algorithm to use for percentage features in my DV and IV, in regression?
I am using regression to analyze server data to find feature importance. 我正在使用回归分析服务器数据来查找功能重要性。
Some of my IVs (independent variables) or Xs are in percentages like % of time, % of cores, % of resource used, while others are in numbers like number of bytes, etc. 我的一些IV(独立变量)或X以百分比表示,如时间百分比,内核百分比,所用资源百分比,而其他IV则以字节数等形式表示。
I standardized all my Xs with (X-X_mean)/X_stddev
. 我使用
(X-X_mean)/X_stddev
所有X进行了(X-X_mean)/X_stddev
。 (Am I wrong in doing so?) (我这样做有错吗?)
Which algorithm should I use in Python in case my IVs are a mix of numeric and %s and I predict Y in the following cases: 如果我的IV是数字和%s的混合,并且在以下情况下我预测Y,则应该在Python中使用哪种算法:
Case 1: Predict a continuous valued Y
情况1:预测连续值Y
a.Will using a Lasso regression suffice?
使用Lasso回归是否足够?
b.
湾 How do I interpret the X-coefficient if X is standardized and is a numeric value?
如果X是标准化的并且是数值,该如何解释X系数?
c.
C。 How do I interpret the X-coefficient if X is standardized and is a %?
如果X是标准化的并且是%,如何解释X系数?
Case 2: Predict a %-ed valued Y, like "% resource used".
情况2:预测%ed值的Y,例如“使用的资源百分比”。
a.
一种。 Should I use Beta-Regression?
我应该使用Beta回归吗? If so which package in Python offers this?
如果是这样,Python中的哪个软件包提供了此功能?
b.
湾 How do I interpret the X-coefficient if X is standardized and is a numeric value?
如果X是标准化的并且是数值,该如何解释X系数?
c.
C。 How do I interpret the X-coefficient if X is standardized and is a %?
如果X是标准化的并且是%,如何解释X系数?
If I am wrong in standardizing the Xs which are % already, is it fine to use these numbers as 0.30 for 30% so that they fall within the range 0-1? 如果我在标准化已经为%的Xs时出错,可以将这些数字用作30%的0.30以使其落在0-1范围内吗? So that means I do not standardize them, I will still standardize the other numeric IVs.
因此,这意味着我不对其进行标准化,但仍将对其他数字IV进行标准化。
Final Aim for both Cases 1 and 2:
案例1和案例2的最终目标:
To find the % of impact of IVs on Y. eg: When X1 increases by 1 unit, Y increases by 21%
查找IV对Y的影响的百分比。例如:当X1增加1个单位时,Y增加21%
I understand from other posts that we can NEVER add up all coefficients to a total of 100 to assess the % of impact of each and every IV on the DV. 我从其他帖子中了解到,我们永远不能将所有系数加起来等于100,以评估每个IV对DV的影响百分比。 I hope I am correct in this regard.
我希望我在这方面是正确的。
Having a mix of predictors doesn't matter for any form of regression, this will only change how you interpret the coefficients. 混合使用预测变量对于任何形式的回归都无关紧要,这只会改变您解释系数的方式。 What does matter, however, is the type/distribution of your Y variable
但是,重要的是Y变量的类型/分布
Case 1: Predict a continuous valued Y a.Will using a Lasso regression suffice?
情况1:预测连续值Y a。使用Lasso回归是否足够?
Regular OLS regression will work fine for this 定期的OLS回归对此可以正常工作
b.
湾 How do I interpret the X-coefficient if X is standardized and is a numeric value?
如果X是标准化的并且是数值,该如何解释X系数?
The interpretation of coefficients always follows a format like "for a 1 unit change in X, we expect an x-coefficient amount of change in Y, holding the other predictors constant" 系数的解释始终遵循以下格式:“对于X的1单位变化,我们期望Y的x系数变化量,同时保持其他预测变量不变”
Because you have standardized X, your unit is a standard deviation. 因为您已标准化X,所以您的单位是标准偏差。 So the interpretation will be "for a 1 standard deviation change in X, we expect an X-coefficient amount of change in Y..."
因此解释为“对于X的1个标准偏差变化,我们期望Y的X系数变化量...”
c.
C。 How do I interpret the X-coefficient if X is standardized and is a %?
如果X是标准化的并且是%,如何解释X系数?
Same as above. 同上。 You units are still standard deviations, despite it originally coming from a percentage
您的单位仍然是标准偏差,尽管最初来自某个百分比
Case 2: Predict a %-ed valued Y, like % resource used.
情况2:预测%ed值Y,例如使用的%资源。
a.
一种。 Should I use Beta-Regression?
我应该使用Beta回归吗? If so which package in Python offers this?
如果是这样,Python中的哪个软件包提供了此功能?
This is tricky. 这很棘手。 The typical recommendation is to use something like binomial logistic regression when your Y outcome is a percentage.
通常的建议是,当您的Y结果是百分比时,使用二项式logistic回归。
b.
湾 How do I interpret the X-coefficient if X is standardized and is a numeric value?
如果X是标准化的并且是数值,该如何解释X系数?
c.
C。 How do I interpret the X-coefficient if X is standardized and is a %?
如果X是标准化的并且是%,如何解释X系数?
Same as interpretations above. 与上面的解释相同。 But if you use logistic regression, they are in the units of log odds.
但是,如果您使用逻辑回归,则它们以对数赔率为单位。 I would recommend reading up on logistic regression to get a deeper sense of how this works
我建议阅读Logistic回归,以更深入地了解其工作原理
If I am wrong in standardizing the Xs which are a % already , is it fine to use these numbers as 0.30 for 30% so that they fall within the range 0-1?
如果我在标准化已经为%的Xs时出错,可以将这些数字用作30%的0.30以使其落在0-1范围内吗? So that means I do not standardize them, I will still standardize the other numeric IVs.
因此,这意味着我不对其进行标准化,但仍将对其他数字IV进行标准化。
Standardizing is perfectly fine for variables in regression, but like I said, it changes your interpretation as your unit is now a standard deviation 标准化对于回归变量非常好,但是就像我说的那样,因为单位现在是标准偏差,所以它可以改变您的解释。
Final Aim for both cases 1 & 2:
案例1和案例2的最终目标:
To find the % of impact of IVs on Y. Eg: When X1 increases by 1 unit, Y increases by 21%
求出IV对Y的影响的百分比。例如:当X1增加1个单位时,Y增加21%
If your Y is a percentage and you use something like OLS regression, then that is exactly how you would interpret the coefficients (for a 1 unit change in X1, Y changes by some percent) 如果您的Y是百分比,并且使用了类似OLS回归的方法,那么这正是您解释系数的方式(对于X1的1单位更改,Y会更改一些百分比)
Your question confuses some concepts and jumbles a lot of terminology. 您的问题混淆了一些概念,并弄乱了许多术语。 Essentially you're asking about a) feature preprocessing for (linear) regression, b) the interpretability of linear regression coefficients, and c) sensitivity analysis (the effect of feature X_i on Y) .
本质上,您是在询问a)用于(线性)回归的特征预处理,b)线性回归系数的可解释性以及c)敏感性分析(特征X_i对Y的影响) 。 But be careful because you're making a huge assumption that Y is linearly dependent on each X_i, see below.
但是要小心,因为您要做出一个巨大的假设,即Y线性依赖于每个X_i,请参见下文。
(X-X_mean)/X_stddev
is not standardization, it's normalization. (X-X_mean)/X_stddev
不是标准化的,而是标准化的。
(X-X_min)/(X_max-X_min)
, which transforms each variable into the range [0,1]; or you can transform to [0,1]. (X-X_min)/(X_max-X_min)
,它将每个变量转换为[0,1]范围;也可以转换为[0,1]。 sqrt(X)
, log(X)
, log1p(X)
, exp(X)
etc. terms. Anything that best captures the nonlinear relationship. You may also see variable-variable interaction terms, although regression strictly assumes variables are not correlated with each other.) sqrt(X)
, log(X)
, log1p(X)
, exp(X)
等项。最能反映非线性关系的任何项。您可能还会看到变量-变量交互作用项,尽管回归严格地假设变量彼此不相关。 (Your question is would get better answers over at CrossValidated , but it's fine to leave here on SO, there is a crossover). (您的问题是在CrossValidated会得到更好的答案,但是在这里就可以了,这是个交叉)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.