简体繁体 English

回归中，DV和IV中用于百分比特征的算法是什么？

[英]Which algorithm to use for percentage features in my DV and IV, in regression?

原文 2019-07-08 10:31:52 1 2 python/ statistics/ regression/ percentage/ feature-extraction

I am using regression to analyze server data to find feature importance. 我正在使用回归分析服务器数据来查找功能重要性。

Some of my IVs (independent variables) or Xs are in percentages like % of time, % of cores, % of resource used, while others are in numbers like number of bytes, etc. 我的一些IV（独立变量）或X以百分比表示，如时间百分比，内核百分比，所用资源百分比，而其他IV则以字节数等形式表示。

I standardized all my Xs with (X-X_mean)/X_stddev . 我使用(X-X_mean)/X_stddev所有X进行了(X-X_mean)/X_stddev 。 (Am I wrong in doing so?) （我这样做有错吗？）

Which algorithm should I use in Python in case my IVs are a mix of numeric and %s and I predict Y in the following cases: 如果我的IV是数字和％s的混合，并且在以下情况下我预测Y，则应该在Python中使用哪种算法：

Case 1: Predict a continuous valued Y 情况1：预测连续值Y

a.Will using a Lasso regression suffice? 使用Lasso回归是否足够？

b. 湾 How do I interpret the X-coefficient if X is standardized and is a numeric value? 如果X是标准化的并且是数值，该如何解释X系数？

c. C。 How do I interpret the X-coefficient if X is standardized and is a %? 如果X是标准化的并且是％，如何解释X系数？

Case 2: Predict a %-ed valued Y, like "% resource used". 情况2：预测％ed值的Y，例如“使用的资源百分比”。

a. 一种。 Should I use Beta-Regression? 我应该使用Beta回归吗？ If so which package in Python offers this? 如果是这样，Python中的哪个软件包提供了此功能？

b. 湾 How do I interpret the X-coefficient if X is standardized and is a numeric value? 如果X是标准化的并且是数值，该如何解释X系数？

c. C。 How do I interpret the X-coefficient if X is standardized and is a %? 如果X是标准化的并且是％，如何解释X系数？

If I am wrong in standardizing the Xs which are % already, is it fine to use these numbers as 0.30 for 30% so that they fall within the range 0-1? 如果我在标准化已经为％的Xs时出错，可以将这些数字用作30％的0.30以使其落在0-1范围内吗？ So that means I do not standardize them, I will still standardize the other numeric IVs. 因此，这意味着我不对其进行标准化，但仍将对其他数字IV进行标准化。

Final Aim for both Cases 1 and 2: 案例1和案例2的最终目标：

To find the % of impact of IVs on Y. eg: When X1 increases by 1 unit, Y increases by 21% 查找IV对Y的影响的百分比。例如：当X1增加1个单位时，Y增加21％

I understand from other posts that we can NEVER add up all coefficients to a total of 100 to assess the % of impact of each and every IV on the DV. 我从其他帖子中了解到，我们永远不能将所有系数加起来等于100，以评估每个IV对DV的影响百分比。 I hope I am correct in this regard. 我希望我在这方面是正确的。

2 个解决方案

Having a mix of predictors doesn't matter for any form of regression, this will only change how you interpret the coefficients. 混合使用预测变量对于任何形式的回归都无关紧要，这只会改变您解释系数的方式。 What does matter, however, is the type/distribution of your Y variable 但是，重要的是Y变量的类型/分布

Case 1: Predict a continuous valued Y a.Will using a Lasso regression suffice? 情况1：预测连续值Y a。使用Lasso回归是否足够？

Regular OLS regression will work fine for this 定期的OLS回归对此可以正常工作

b. 湾 How do I interpret the X-coefficient if X is standardized and is a numeric value? 如果X是标准化的并且是数值，该如何解释X系数？

The interpretation of coefficients always follows a format like "for a 1 unit change in X, we expect an x-coefficient amount of change in Y, holding the other predictors constant" 系数的解释始终遵循以下格式：“对于X的1单位变化，我们期望Y的x系数变化量，同时保持其他预测变量不变”

Because you have standardized X, your unit is a standard deviation. 因为您已标准化X，所以您的单位是标准偏差。 So the interpretation will be "for a 1 standard deviation change in X, we expect an X-coefficient amount of change in Y..." 因此解释为“对于X的1个标准偏差变化，我们期望Y的X系数变化量...”

c. C。 How do I interpret the X-coefficient if X is standardized and is a %? 如果X是标准化的并且是％，如何解释X系数？

Same as above. 同上。 You units are still standard deviations, despite it originally coming from a percentage 您的单位仍然是标准偏差，尽管最初来自某个百分比

Case 2: Predict a %-ed valued Y, like % resource used. 情况2：预测％ed值Y，例如使用的％资源。

a. 一种。 Should I use Beta-Regression? 我应该使用Beta回归吗？ If so which package in Python offers this? 如果是这样，Python中的哪个软件包提供了此功能？

This is tricky. 这很棘手。 The typical recommendation is to use something like binomial logistic regression when your Y outcome is a percentage. 通常的建议是，当您的Y结果是百分比时，使用二项式logistic回归。

b. 湾 How do I interpret the X-coefficient if X is standardized and is a numeric value? 如果X是标准化的并且是数值，该如何解释X系数？

c. C。 How do I interpret the X-coefficient if X is standardized and is a %? 如果X是标准化的并且是％，如何解释X系数？

Same as interpretations above. 与上面的解释相同。 But if you use logistic regression, they are in the units of log odds. 但是，如果您使用逻辑回归，则它们以对数赔率为单位。 I would recommend reading up on logistic regression to get a deeper sense of how this works 我建议阅读Logistic回归，以更深入地了解其工作原理

If I am wrong in standardizing the Xs which are a % already , is it fine to use these numbers as 0.30 for 30% so that they fall within the range 0-1? 如果我在标准化已经为％的Xs时出错，可以将这些数字用作30％的0.30以使其落在0-1范围内吗？ So that means I do not standardize them, I will still standardize the other numeric IVs. 因此，这意味着我不对其进行标准化，但仍将对其他数字IV进行标准化。

Standardizing is perfectly fine for variables in regression, but like I said, it changes your interpretation as your unit is now a standard deviation 标准化对于回归变量非常好，但是就像我说的那样，因为单位现在是标准偏差，所以它可以改变您的解释。

Final Aim for both cases 1 & 2: 案例1和案例2的最终目标：

To find the % of impact of IVs on Y. Eg: When X1 increases by 1 unit, Y increases by 21% 求出IV对Y的影响的百分比。例如：当X1增加1个单位时，Y增加21％

If your Y is a percentage and you use something like OLS regression, then that is exactly how you would interpret the coefficients (for a 1 unit change in X1, Y changes by some percent) 如果您的Y是百分比，并且使用了类似OLS回归的方法，那么这正是您解释系数的方式（对于X1的1单位更改，Y会更改一些百分比）

Your question confuses some concepts and jumbles a lot of terminology. 您的问题混淆了一些概念，并弄乱了许多术语。 Essentially you're asking about a) feature preprocessing for (linear) regression, b) the interpretability of linear regression coefficients, and c) sensitivity analysis (the effect of feature X_i on Y) . 本质上，您是在询问a）用于（线性）回归的特征预处理，b）线性回归系数的可解释性以及c）敏感性分析（特征X_i对Y的影响） 。 But be careful because you're making a huge assumption that Y is linearly dependent on each X_i, see below. 但是要小心，因为您要做出一个巨大的假设，即Y线性依赖于每个X_i，请参见下文。

Standardization is not an "algorithm", just a technique for preprocessing data. 标准化不是一种“算法”，而只是一种预处理数据的技术。
Standardization is needed for regression, but it is not needed for tree-based algorithms (RF/XGB/GBT) - with those, you can feed in raw numeric features directly (percents, totals, whatever). 回归需要标准化，但基于树的算法（RF / XGB / GBT）则不需要标准化 -使用这些算法，您可以直接输入原始数字特征（百分比，总计，其他）。
(X-X_mean)/X_stddev is not standardization, it's normalization. (X-X_mean)/X_stddev 不是标准化的，而是标准化的。
- (An alternative to that is (true) standardization which is: (X-X_min)/(X_max-X_min) , which transforms each variable into the range [0,1]; or you can transform to [0,1]. （替代方法是（true） 标准化 ，它是： (X-X_min)/(X_max-X_min) ，它将每个变量转换为[0,1]范围；也可以转换为[0,1]。
Last you ask about sensitivity analysis in regression : Can we directly interpret the regression coefficient for X_i as the sensitivity of Y on X_i? 最后您问一下回归中的敏感性分析 ：能否将X_i的回归系数直接解释为Y对X_i的敏感性？
- Stop and think about your underlying linearity assumption in "Final Aim for both cases 1 & 2: To find the % of impact of IVs on Y. Eg: When X1 increases by 1 unit, Y increases by 21%" . 在“案例1和案例2的最终目标”中停下来想一想您的基本线性假设， 以查找IV对Y的影响百分比。例如：当X1增加1个单位时，Y增加21％” 。
- you're assuming that the Dependent Variable has a linear relationship with each Independent Variable. 您假设因变量与每个自变量具有线性关系。 But that is often not the case, it may be nonlinear . 但这通常不是这种情况，它可能是非线性的 。 For example, if you're looking at the effect of Age on Salary, you would typically see it increase up to 40s/50s, then decrease gradually, and when you hit retirement age (say 65), decrease sharply. 例如，如果您正在查看年龄对薪资的影响，通常会看到它增加到40s / 50s，然后逐渐降低，而当您达到退休年龄（例如65岁）时，则急剧下降。
- so, you would model the effect of Age on Salary as quadratic or higher-order polynomial, by throwing in Age^2 and maybe Age^3 terms (or else sometimes you might see sqrt(X) , log(X) , log1p(X) , exp(X) etc. terms. Anything that best captures the nonlinear relationship. You may also see variable-variable interaction terms, although regression strictly assumes variables are not correlated with each other.) 因此，您可以通过抛出Age ^ 2甚至是Age ^ 3项，将Age对薪水的影响建模为二次多项式或更高阶多项式（否则有时您可能会看到sqrt(X) ， log(X) ， log1p(X) ， exp(X)等项。最能反映非线性关系的任何项。您可能还会看到变量-变量交互作用项，尽管回归严格地假设变量彼此不相关。
- obviously, Age has a huge effect on Salary, but we would not measure the sensitivity of Salary to Age by combining the (absolute value of) coefficients of Age, Age^2, Age^3. 显然，年龄对薪资影响巨大，但我们无法通过结合年龄，年龄^ 2，年龄^ 3的（绝对值）系数来衡量薪金对年龄的敏感性。
- if we only had a linear term for Age, the single coefficient for Age would massively understate the influence of Age on Salary, it would net "average out" the strong positive relationship for the regime Age<40 versus the negative relationship for Age>50 如果我们仅使用年龄的线性项，则年龄的单个系数将大大低估年龄对薪资的影响，它将净“平均”年龄为40岁以下体制与年龄大于50岁时的负相关关系。
So the general answer to "Can we directly interpret the regression coefficient for X_i as the sensitivity of Y on X_i?" 因此，一般的回答是“我们可以直接将X_i的回归系数解释为Y对X_i的敏感度吗？” is "Only if the relationship between Y and that X_i is linear, otherwise no" . 为“仅当Y和X_i之间的关系是线性的，否则为否” 。
In general, a better and easier way to do sensitivity analysis (without assuming linear response, or needing standardization of % features) is tree-based algorithms (RF/XGB/GBT) which generate feature importances . 通常，进行敏感性分析的更好，更简便的方法（不假设线性响应，也不要求对特征进行标准化）是基于树的算法（RF / XGB / GBT） ，它会产生特征重要性 。
- As an aside, I understand your exercise tells you to use regression, but in general you get better faster feature-importance information from tree-based (RF/XGB), especially for a shallow tree (small value for max_depth, large value of nodesize eg >0.1% of training-set size). 顺便说一句，我理解您的练习告诉您使用回归，但是总的来说，您可以从基于树的（RF / XGB）中获得更好的功能重要性信息，尤其是对于浅树（max_depth值较小，nodesize值较大）时例如，大于训练集大小的0.1％）。 That's why people use it, even when their final goal is regression. 这就是为什么人们使用它的原因，即使他们的最终目标是回归。

(Your question is would get better answers over at CrossValidated , but it's fine to leave here on SO, there is a crossover). （您的问题是在CrossValidated会得到更好的答案，但是在这里就可以了，这是个交叉）。