简体   繁体   中英

How can I correct linear regression for a large (unknown) constant added to all of my input features?

Suppose I have an input feature vector containing 10 input features, each with order of magnitude around 1E-7 .

When I run linear regression with the log of these input features, I get an R^2 value of around 0.98 .

However, if I add 1E-2 to each of my input features before running through the above fit, I get an R^2 value of 0.5616 .

The problem is that I will not know a priori that the constant that was added to my input features was 1E-2 , so I cannot simply subtract off that quantity every time.

Is there a general way to correct for a large, unknown constant added to my input feature set?

Here is a sample input file: http://stanford.edu/~hq6/13

Here is a corresponding output file: http://stanford.edu/~hq6/15

Here is some code that is used for training:

input_features = read.csv('InputFeatures.csv', header=F)

# Adding constant error term to all input features
input_features = input_features + 1E-2
# How can we correct for this constant if we do not know what the constant is beforehand?

input_features[input_features <= 0] = 1E-10
input_features = log(input_features)
output = read.csv('Output.csv', header=F)

full_data = data.frame(input_features,  output)
summary(lm(V1.1 ~ ., data=full_data))

When this code is run without the line input_features = input_features + 1E-2 , I get an R-squared of approximately 0.98 from the summary output.

When this line is put in, then the R-squared of less than 0.5 .

So you're suggesting your dataset fits y = A + B*exp(C*x) . Why not do a direct fit using nls or other nonlinear fitting tools?

If you wish to do a linear fit to the log of both sides, it should be obvious from the rules of logarithms (eg log(ab) = log(a) + log(b) ) that you cannot separate out the effect of two summed terms.

Linear regression on the R^10 results in 11 real numbers being coefficients of the 10-dimensional hyperplane. From your post it seems that you have one ("value of ...") or at most two ("R^2") which still seems wrong.

Or maybe by R^2 you meant residuals error?

Linear regression itself is invariant to adding a constant, as long as it does not lead to some numerical imprecision and you add it to all your features. If you add to just one then it is quite obvious that it will change results - as this dimension may become more/less important (depending on the sign of the constant). In order to make it invariant to such operations you can normalize your data (by linearly scaling to the interval or normalizing to mean=0 and std=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM