繁体   English   中英

如何在R中的各个组中使用na.approx函数进行内插/外推

[英]How to interpolate/extrapolate using na.approx function within individual groups in R

我有一个面板数据集,其中包含18个年份(2000-2017年)中60个国家/地区的10个变量,并且我有很多缺失的数据。

Country Year    Broadband

Albania 2000    NA
Albania 2001    NA
Albania 2002    NA
Albania 2003    NA
Albania 2004    NA
Albania 2005    272
Albania 2006    NA
Albania 2007    10000
Albania 2008    64000
Albania 2009    92000
Albania 2010    105539
Albania 2011    128210
Albania 2012    160088
Albania 2013    182556
Albania 2014    207931
Albania 2015    242870
Albania 2016    263874
Albania 2017    NA
Algeria 2000    NA
Algeria 2001    NA
Algeria 2002    NA
Algeria 2003    18000
Algeria 2004    36000

我想在R中使用na.approx函数进行插值(并使用rule = 2进行插值),但只能在每个国家/地区内插值。 例如,在此样本数据集中,我想对2006年的阿尔巴尼亚值进行插值,并对2000-2004年和2017年的阿尔巴尼亚值进行插值。但是,我想确保不使用Albania 2016和Algeria 2003来对2017年的值进行插值。对于阿尔及利亚2000-2002,我希望使用阿尔及利亚2003和2004的数据推断值。我尝试了以下代码:

data <- group_by(data, country)
data$broadband <- na.approx(data$broadband, maxgap = Inf, rule = 2)
data <- as.data.frame(data)

并尝试了不同的maxgap值,但似乎没有一个可以解决我的问题。 我假设使用group_by函数可以正常工作,但不能正常工作。 有人知道任何解决方案吗?

编辑:我想到要做的唯一方法是使用以下代码将数据集拆分为每个唯一国家的单独数据集:

mylist <- split(data, data$country)

alb <- mylist[1]
alb <- as_data_frame(alb)
alg <- mylist[2]
alg <- as_data_frame(alg)
ang <- mylist[3]
ang <- as_data_frame(ang)

然后一次在单独的数据集上使用na.approx函数。

编辑2:

我已经尝试了下面Markus建议的解决方案,但似乎没有用。 这是使用建议的Angola值编码的结果:

Country Year    Broadband   Broadband_imp

Algeria 2014    1599692 1599692
Algeria 2015    2269348 2269348
Algeria 2016    2858906 2858906
Angola  2000    NA  2451556.286
Angola  2001    NA  2044206.571
Angola  2002    NA  1636856.857
Angola  2003    NA  1229507.143
Angola  2004    NA  822157.429
Angola  2005    NA  414807.714
Angola  2006    7458    7458
Angola  2007    11700   11700

如您所见,安哥拉2000-2005年的估算值似乎是使用阿尔及利亚的值计算的,因为估算值远高于应给定的安哥拉2006年值为7458的值。

编辑3:这是我使用的完整代码-

data <- read_excel("~/Documents/data.xlsx")

> dput(head(data))
structure(list(continent = c("Europe", "Europe", "Europe", "Europe", 
"Europe", "Europe"), country = c("Albania", "Albania", "Albania", 
"Albania", "Albania", "Albania"), Year = c(2000, 2001, 2002, 
2003, 2004, 2005), `Individuals Using Internet, %, WB` = c(0.114097347, 
0.325798377, 0.390081273, 0.971900415, 2.420387798, 6.043890864
), `Secure Internet Servers, WB` = c(NA, 1, NA, 1, 2, 1), `Mobile Cellular 
Subscriptions, WB` = c(29791, 
392650, 851000, 1100000, 1259590, 1530244), `Fixed Broadband Subscriptions, 
WB` = c(NA, 
NA, NA, NA, NA, 272), `Trade, % GDP, WB` = c(55.9204287230026, 
57.4303612453301, 63.9342407411882, 65.4406219482911, 66.3578254370479, 
70.2953012017195), `Air transport, freight (million ton-km)` = c(0.003, 
0.003, 0.144, 0.088, 0.099, 0.1), `Air Transport, registered carrier 
departures worldwide, WB` = c(3885, 
3974, 3762, 3800, 4104, 4309), `FDI, net, inflows, % GDP, WB` = 
c(3.93717707227928, 
5.10495722596557, 3.04391445388559, 3.09793068135411, 4.66563777108359, 
3.21722676118428), `Number of Airports, WFB` = c(10, 11, 11, 
11, 11, 11), `Currently under EU Arms Sanctions` = c(0, 0, 0, 
0, 0, 0), `Currently under EU Economic Sanctions` = c(0, 0, 0, 
0, 0, 0), `Currently under UN Arms Sanctions` = c(0, 0, 0, 0, 
0, 0), `Currently under UN Economic Sanctions` = c(0, 0, 0, 0, 
0, 0), `Currently under US Arms Embargo` = c(0, 0, 0, 0, 0, 0
), `Currently under US Economic Sanctions` = c(0, 0, 0, 0, 0, 
0)), .Names = c("continent", "country", "Year", "Individuals Using Internet, 
%, WB", 
"Secure Internet Servers, WB", "Mobile Cellular Subscriptions, WB", 
"Fixed Broadband Subscriptions, WB", "Trade, % GDP, WB", "Air transport, 
freight (million ton-km)", 
"Air Transport, registered carrier departures worldwide, WB", 
"FDI, net, inflows, % GDP, WB", "Number of Airports, WFB", "Currently under EU 
 Arms Sanctions", 
"Currently under EU Economic Sanctions", "Currently under UN Arms Sanctions", 
"Currently under UN Economic Sanctions", "Currently under US Arms Embargo", 
"Currently under US Economic Sanctions"), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

 data_imputed <- data %>% 
group_by(country) %>% 
mutate(broadband_imp = na.approx(broadband, maxgap=Inf, rule = 2))

您可以使用group_bymutate

library(tidyverse)
library(zoo)

df_imputed <- df %>% 
group_by(Country) %>% 
mutate(Broadband_imputed = na.approx(Broadband, maxgap = Inf, rule = 2))

这使

> head(df_imputed)
# A tibble: 6 x 4
# Groups:   Country [1]
  Country  Year Broadband Broadband_imputed
   <fctr> <int>     <int>             <dbl>
1 Albania  2000        NA               272
2 Albania  2001        NA               272
3 Albania  2002        NA               272
4 Albania  2003        NA               272
5 Albania  2004        NA               272
6 Albania  2005       272               272

> df_imputed %>% filter(Country == 'Algeria')
# A tibble: 5 x 4
# Groups:   Country [1]
  Country  Year Broadband Broadband_imputed
   <fctr> <int>     <int>             <dbl>
1 Algeria  2000        NA             18000
2 Algeria  2001        NA             18000
3 Algeria  2002        NA             18000
4 Algeria  2003     18000             18000
5 Algeria  2004     36000             36000

数据

df <- read.table(text = "Country Year    Broadband
Albania 2000    NA
Albania 2001    NA
Albania 2002    NA
Albania 2003    NA
Albania 2004    NA
Albania 2005    272
Albania 2006    NA
Albania 2007    10000
Albania 2008    64000
Albania 2009    92000
Albania 2010    105539
Albania 2011    128210
Albania 2012    160088
Albania 2013    182556
Albania 2014    207931
Albania 2015    242870
Albania 2016    263874
Albania 2017    NA
Algeria 2000    NA
Algeria 2001    NA
Algeria 2002    NA
Algeria 2003    18000
Algeria 2004    36000", header = TRUE)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM