简体   繁体   English

在 R 中使用线性回归填充 NA

[英]Filling NA using linear regression in R

I have a data with one time column and 2 variables.(example below)我有一个时间列和 2 个变量的数据。(下面的例子)

df <- structure(list(time = c(15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 
                              25, 26), var1 = c(20.4, 31.5, NA, 53.7, 64.8, NA, NA, NA, NA, 
                              120.3, NA, 142.5), var2 = c(30.6, 47.25, 63.9, 80.55, 97.2, 113.85, 
                              130.5, 147.15, 163.8, 180.45, 197.1, 213.75)), .Names = c("time", 
                              "var1", "var2"), row.names = c(NA, -12L), class = c("tbl_df", 
                              "tbl", "data.frame"))

The var1 has few NA and I want to fill the NA with linear regression between remaining values in var1 and var2. var1 的 NA 很少,我想用 var1 和 var2 中剩余值之间的线性回归填充 NA。

Please Help!!请帮助!! And let me know if you need more information如果您需要更多信息,请告诉我

Here is an example using lm to predict values in R.这是使用lm预测 R 中的值的示例。

library(dplyr)

# Construct linear model based on non-NA pairs
df2 <- df %>% filter(!is.na(var1))

fit <- lm(var1 ~ var2, data = df2)

# See the result
summary(fit)

# Call:
#   lm(formula = var1 ~ var2, data = df2)
# 
# Residuals:
#   1          2          3          4          5          6 
# 8.627e-15 -2.388e-15  1.546e-16 -9.658e-15 -2.322e-15  5.587e-15 
# 
# Coefficients:
#   Estimate Std. Error   t value Pr(>|t|)    
# (Intercept) 2.321e-14  5.619e-15 4.130e+00   0.0145 *  
#   var2        6.667e-01  4.411e-17 1.511e+16   <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 7.246e-15 on 4 degrees of freedom
# Multiple R-squared:      1,   Adjusted R-squared:      1 
# F-statistic: 2.284e+32 on 1 and 4 DF,  p-value: < 2.2e-16
# 
# Warning message:
#   In summary.lm(fit) : essentially perfect fit: summary may be unreliable

# Use fit to predict the value
df3 <- df %>% 
  mutate(pred = predict(fit, .)) %>%
  # Replace NA with pred in var1
  mutate(var1 = ifelse(is.na(var1), pred, var1))

# See the result
df3 %>% as.data.frame()

#    time  var1   var2  pred
# 1    15  20.4  30.60  20.4
# 2    16  31.5  47.25  31.5
# 3    17  42.6  63.90  42.6
# 4    18  53.7  80.55  53.7
# 5    19  64.8  97.20  64.8
# 6    20  75.9 113.85  75.9
# 7    21  87.0 130.50  87.0
# 8    22  98.1 147.15  98.1
# 9    23 109.2 163.80 109.2
# 10   24 120.3 180.45 120.3
# 11   25 131.4 197.10 131.4
# 12   26 142.5 213.75 142.5

Here is a one liner using the approx function from base R:这是一个使用基本 R 中的approx函数的单行:

newvar1<-approx(df$time, df$var1, xout=df$time)

This function will apply a linear approximation between neighboring points was opposed to "www" answer which applies the linear approximation across all of the points.该函数将在相邻点之间应用线性近似,而不是“www”答案,后者在所有点上应用线性近似。 With this data, both solutions provide the same results since time and var1 has a perfect linear relationship, may not always be the case.有了这些数据,两种解决方案都能提供相同的结果,因为 time 和 var1 具有完美的线性关系,但情况并非总是如此。
The xout option specifies the location where to estimate the new values, in this case I am passing the original time vector. xout 选项指定估计新值的位置,在这种情况下,我传递的是原始时间向量。

Related: See the spline function for a cubic approximation.相关:请参阅三次近似的spline函数。

I realize this is an old question but this might be a useful brute-force technique我意识到这是一个老问题,但这可能是一种有用的蛮力技术

generate your linear model生成您的线性模型

fit <- lm(var1 ~ var2, data = df)

Save the coefficients into an object using coef()使用 coef() 将系数保存到对象中

fit.c <- coef(fit)
fit.c

Use those coefficient to generate a predicted value as a new variable.使用这些系数生成预测值作为新变量。 The bracketed numbers indicate the position of the coefficient in the vector fit.c.括号中的数字表示系数在向量 fit.c 中的位置。 fit.c[1] is the intercept. fit.c[1] 是截距。

df$pred <- fit.c[1] + fit.c[2]*df$var2

You may at this time replace NA values in the original variable此时您可以替换原始变量中的 NA 值

df$var1[is.na(df$var1)] <- df$pred 

However my instincts say to not overwrite values in your original variable and instead use pred for whatever purpose you planned for var1.然而,我的直觉是不要覆盖原始变量中的值,而是将 pred 用于您为 var1 计划的任何目的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM