[英]How to interpolate data with R?
I have a data frame that look like this:我有一个看起来像这样的数据框:
z <- data.frame(ent = c(1, 1, 1, 2, 2, 2, 3, 3, 3), year = c(1995, 2000, 2005, 1995, 2000, 2005, 1995, 2000, 2005), pobtot = c(50, 60, 70, 10, 4, 1, 100, 105, 110))
As you can see, there is a gap between 5 years for every "ent".如您所见,每个“ent”之间都有5年的差距。 I want to interpolate data to every missing year: 1996, 1997, 1998, 1999, 2001, 2002, 2003, 2004 and also prognosticate to 2006, 2007 and 2008. Is there a way to do this?
我想将数据插入到每个缺失的年份:1996、1997、1998、1999、2001、2002、2003、2004 并预测到 2006、2007 和 2008。有没有办法做到这一点?
Any help would be appreciated.任何帮助,将不胜感激。
We can use complete
to expand the data for each 'ent' and the 'year' range, then with na.approx
interpolate the missing values in 'pobtot'我们可以使用
complete
来扩展每个 'ent' 和 'year' 范围的数据,然后使用na.approx
插入 'pobtot' 中的缺失值
library(dplyr)
library(tidyr)
z %>%
complete(ent, year = 1995:2008) %>%
mutate(pobtot = zoo::na.approx(pobtot, na.rm = FALSE))
Assuming you want linear interpolation, R uses approx()
for such things by default, eg for drawing lines in a plot.假设你想要线性插值,R 默认使用
approx()
来处理这些事情,例如在 plot 中绘制线条。 We may also use that function to interpolate the years.我们也可以使用 function 来插入年份。 It doesn't extrapolate, though, but we could use
forecast::ets()
with default settings for this which calculates an exponential smoothing state space model.不过,它不能外推,但我们可以使用带有默认设置的
forecast::ets()
来计算指数平滑 state 空间 model。 Note, however, that this may also produce negative values, but OP hasn't stated what is needed in such a case.但是请注意,这也可能会产生负值,但 OP 并未说明在这种情况下需要什么。 So anyway in a
by()
approach we could do:所以无论如何,在
by()
方法中,我们可以这样做:
library(forecast)
p <- 3 ## define number of years for prediction
res <- do.call(rbind, by(z, z$ent, function(x) {
yseq <- min(x$year):(max(x$year) + p) ## sequence of years + piction
a <- approx(x$year, x$pobtot, head(yseq, -p))$y ## linear interpolation
f <- predict(ets(a), 3) ## predict `p` years
r <- c(a, f$mean) ## combine interpolation and prediction
data.frame(ent=x$ent[1], year=yseq, pobtot=r) ## output as data frame
}))
res
# ent year pobtot
# 1.1 1 1995 50.0
# 1.2 1 1996 52.0
# 1.3 1 1997 54.0
# 1.4 1 1998 56.0
# 1.5 1 1999 58.0
# 1.6 1 2000 60.0
# 1.7 1 2001 62.0
# 1.8 1 2002 64.0
# 1.9 1 2003 66.0
# 1.10 1 2004 68.0
# 1.11 1 2005 70.0
# 1.12 1 2006 72.0
# 1.13 1 2007 74.0
# 1.14 1 2008 76.0
# 2.1 2 1995 10.0
# 2.2 2 1996 8.8
# 2.3 2 1997 7.6
# 2.4 2 1998 6.4
# 2.5 2 1999 5.2
# 2.6 2 2000 4.0
# 2.7 2 2001 3.4
# 2.8 2 2002 2.8
# 2.9 2 2003 2.2
# 2.10 2 2004 1.6
# 2.11 2 2005 1.0
# 2.12 2 2006 0.4
# 2.13 2 2007 -0.2
# 2.14 2 2008 -0.8
# 3.1 3 1995 100.0
# 3.2 3 1996 101.0
# 3.3 3 1997 102.0
# 3.4 3 1998 103.0
# 3.5 3 1999 104.0
# 3.6 3 2000 105.0
# 3.7 3 2001 106.0
# 3.8 3 2002 107.0
# 3.9 3 2003 108.0
# 3.10 3 2004 109.0
# 3.11 3 2005 110.0
# 3.12 3 2006 111.0
# 3.13 3 2007 112.0
# 3.14 3 2008 113.0
We could quickly check this in a plot, which, apart from the negative values of entity 2 looks quite reasonable.我们可以在 plot 中快速检查这一点,除了实体 2 的负值之外,它看起来非常合理。
with(res, plot(year, pobtot, type='n', main='z'))
with(res[res$year < 2006, ], points(year, pobtot, pch=20, col=3))
with(res[res$year > 2005, ], points(year, pobtot, pch=20, col=4))
with(res[res$year %in% z$year, ], points(year, pobtot, pch=20, col=1))
abline(h=0, lty=3)
legend(2005.25, 50, c('measurem.', 'interpol.', 'extrapol.'), pch=20,
col=c(1, 3, 4), cex=.8, bty='n')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.