[英]How to fill NA by taking mean of two columns in R?
Below is the dput
of my dataset. 下面是
dput
我的数据集。 I am trying to fill my dataset such that if there is NA
present in particular column of the year, then the NA
should be filled with mean
of other two years. 我正在尝试填充我的数据集,以便如果该年的特定列中存在
NA
,则该NA
应以其他两年的mean
填充。 For example, in the dataset below, Congo contains NA
for the "Economy.2015" column, so that NA
should be filled with mean from the columns "Economy.2016" and "Economy.2017". 例如,在下面的数据集中,刚果在“ Economy.2015”列中包含
NA
,因此应使用“ Economy.2016”和“ Economy.2017”列中的均值填充NA
。
dput dput
structure(list(Country = c("Angola", "Bosnia and Herzegovina",
"Congo (Kinshasa)", "Greece", "Indonesia", "Iraq", "Sierra Leone",
"Sudan", "Togo"), Region = c("Sub-Saharan Africa", "Central and Eastern Europe",
"Sub-Saharan Africa", "Western Europe", "Southeastern Asia",
"Middle East and Northern Africa", "Sub-Saharan Africa", "Sub-Saharan Africa",
"Sub-Saharan Africa"), Happiness.Rank.2015 = c(137L, 96L, 120L,
102L, 74L, 112L, 123L, 118L, 158L), Happiness.Score.2015 = c(4.033,
4.949, 4.517, 4.857, 5.399, 4.677, 4.507, 4.55, 2.839), Standard.Error.2015 = c(0.04758,
0.06913, 0.0368, 0.05062, 0.02596, 0.05232, 0.07068, 0.0674,
0.06727), Economy.2015 = c(0.75778, 0.83223, NA, 1.15406, 0.82827,
0.98549, 0.33024, 0.52107, 0.20868), Family.2015 = c(0.8604,
0.91916, 1.0012, 0.92933, 1.08708, 0.81889, 0.95571, 1.01404,
0.13995), Health.2015 = c(0.16683, 0.79081, 0.09806, 0.88213,
0.63793, 0.60237, NA, 0.36878, 0.28443), Freedom.2015 = c(0.10384,
0.09245, 0.22605, 0.07699, 0.46611, NA, 0.4084, 0.10081, 0.36453
), Trust.2015 = c(0.07122, 0.00227, 0.07625, 0.01397, NA, 0.13788,
0.08786, 0.1466, 0.10731), Generosity.2015 = c(0.12344, 0.24808,
0.24834, NA, 0.51535, 0.17922, 0.21488, 0.19062, 0.16681), Dystopia.Residual.2015 = c(1.94939,
2.06367, 2.86712, 1.80101, 1.86399, 1.95335, 2.51009, 2.20857,
1.56726), Region.2016 = c("Sub-Saharan Africa", "Central and Eastern Europe",
"Sub-Saharan Africa", "Western Europe", "Southeastern Asia",
"Middle East and Northern Africa", "Sub-Saharan Africa", "Sub-Saharan Africa",
"Sub-Saharan Africa"), Happiness.Rank.2016 = c(141L, 87L, 125L,
99L, 79L, 112L, 111L, 133L, 155L), Happiness.Score.2016 = c(3.866,
5.163, 4.272, 5.033, 5.314, 4.575, 4.635, 4.139, 3.303), Lower.CI.2016 = c(3.753,
5.063, 4.191, 4.935, 5.237, 4.446, 4.505, 3.928, 3.192), Upper.CI.2016 = c(3.979,
5.263, 4.353, 5.131, 5.391, 4.704, 4.765, 4.35, 3.414), Economy.2016 = c(0.84731,
0.93383, 0.05661, 1.24886, 0.95104, 1.07474, 0.36485, 0.63069,
0.28123), Family.2016 = c(0.66366, 0.64367, 0.80676, 0.75473,
0.87625, 0.59205, 0.628, 0.81928, NA), Health.2016 = c(0.04991,
0.70766, 0.188, 0.80029, 0.49374, 0.51076, NA, 0.29759, 0.24811
), Freedom.2016 = c(0.00589, 0.09511, 0.15602, 0.05822, 0.39237,
0.24856, 0.30685, NA, 0.34678), Trust.2016 = c(0.08434, NA, 0.06075,
0.04127, 0.00322, 0.13636, 0.08196, 0.10039, 0.11587), Generosity.2016 = c(0.12071,
0.29889, 0.25458, NA, 0.56521, 0.19589, 0.23897, 0.18077, 0.17517
), Dystopia.Residual.2016 = c(2.09459, 2.48406, 2.74924, 2.12944,
2.03171, 1.81657, 3.01402, 2.10995, 2.1354), Happiness.Rank.2017 = c(140L,
90L, 126L, 87L, 81L, 117L, 106L, 130L, 150L), Happiness.Score.2017 = c(3.79500007629395,
5.18200016021729, 4.28000020980835, 5.22700023651123, 5.26200008392334,
4.49700021743774, 4.70900011062622, 4.13899993896484, 3.49499988555908
), Whisker.high.2017 = c(3.95164193540812, 5.27633568674326,
4.35781083270907, 5.3252461694181, 5.35288859814405, 4.62259140968323,
4.85064333498478, 4.34574716508389, 3.59403811171651), whisker.low.2017 = c(3.63835821717978,
5.08766463369131, 4.20218958690763, 5.12875430360436, 5.17111156970263,
4.37140902519226, 4.56735688626766, 3.9322527128458, 3.39596165940166
), Economy.2017 = c(0.858428180217743, 0.982409417629242, 0.0921023488044739,
1.28948748111725, 0.995538592338562, 1.10271048545837, 0.36842092871666,
0.65951669216156, 0.305444717407227), Family.2017 = c(1.10441195964813,
1.0693359375, 1.22902345657349, 1.23941457271576, 1.27444469928741,
0.978613197803497, 0.984136044979095, 1.21400856971741, 0.431882530450821
), Health.2017 = c(0.0498686656355858, 0.705186307430267, 0.191407024860382,
0.810198903083801, 0.492345720529556, 0.501180469989777, 0.00556475389748812,
0.290920823812485, 0.247105568647385), Freedom.2017 = c(NA, 0.204403176903725,
0.235961347818375, 0.0957312509417534, 0.443323463201523, 0.288555532693863,
0.318697690963745, 0.0149958552792668, 0.38042613863945), Generosity.2017 = c(0.097926490008831,
0.328867495059967, 0.246455833315849, NA, 0.611704587936401,
0.19963726401329, 0.293040901422501, 0.182317450642586, 0.196896150708199
), Trust.2017 = c(0.0697203353047371, NA, 0.0602413564920425,
0.04328977689147, 0.0153171354904771, 0.107215754687786, 0.0710951760411263,
0.089847519993782, 0.0956650152802467), Dystopia.Residual.2017 = c(1.61448240280151,
1.89217257499695, 2.22495865821838, 1.74922156333923, 1.42947697639465,
1.31890726089478, 2.66845989227295, 1.68706583976746, 1.83722925186157
)), class = "data.frame", row.names = c(NA, -9L))
Structure of dataframe 数据框的结构
Country Region Happiness.Rank.2015 Happiness.Score.2015
1 Angola Sub-Saharan Africa 137 4.033
2 Bosnia and Herzegovina Central and Eastern Europe 96 4.949
3 Congo (Kinshasa) Sub-Saharan Africa 120 4.517
4 Greece Western Europe 102 4.857
5 Indonesia Southeastern Asia 74 5.399
6 Iraq Middle East and Northern Africa 112 4.677
7 Sierra Leone Sub-Saharan Africa 123 4.507
8 Sudan Sub-Saharan Africa 118 4.550
9 Togo Sub-Saharan Africa 158 2.839
Standard.Error.2015 Economy.2015 Family.2015 Health.2015 Freedom.2015 Trust.2015 Generosity.2015
1 0.04758 0.75778 0.86040 0.16683 0.10384 0.07122 0.12344
2 0.06913 0.83223 0.91916 0.79081 0.09245 0.00227 0.24808
3 0.03680 NA 1.00120 0.09806 0.22605 0.07625 0.24834
4 0.05062 1.15406 0.92933 0.88213 0.07699 0.01397 NA
5 0.02596 0.82827 1.08708 0.63793 0.46611 NA 0.51535
6 0.05232 0.98549 0.81889 0.60237 NA 0.13788 0.17922
7 0.07068 0.33024 0.95571 NA 0.40840 0.08786 0.21488
8 0.06740 0.52107 1.01404 0.36878 0.10081 0.14660 0.19062
9 0.06727 0.20868 0.13995 0.28443 0.36453 0.10731 0.16681
Dystopia.Residual.2015 Region.2016 Happiness.Rank.2016 Happiness.Score.2016
1 1.94939 Sub-Saharan Africa 141 3.866
2 2.06367 Central and Eastern Europe 87 5.163
3 2.86712 Sub-Saharan Africa 125 4.272
4 1.80101 Western Europe 99 5.033
5 1.86399 Southeastern Asia 79 5.314
6 1.95335 Middle East and Northern Africa 112 4.575
7 2.51009 Sub-Saharan Africa 111 4.635
8 2.20857 Sub-Saharan Africa 133 4.139
9 1.56726 Sub-Saharan Africa 155 3.303
Lower.CI.2016 Upper.CI.2016 Economy.2016 Family.2016 Health.2016 Freedom.2016 Trust.2016
1 3.753 3.979 0.84731 0.66366 0.04991 0.00589 0.08434
2 5.063 5.263 0.93383 0.64367 0.70766 0.09511 NA
3 4.191 4.353 0.05661 0.80676 0.18800 0.15602 0.06075
4 4.935 5.131 1.24886 0.75473 0.80029 0.05822 0.04127
5 5.237 5.391 0.95104 0.87625 0.49374 0.39237 0.00322
6 4.446 4.704 1.07474 0.59205 0.51076 0.24856 0.13636
7 4.505 4.765 0.36485 0.62800 NA 0.30685 0.08196
8 3.928 4.350 0.63069 0.81928 0.29759 NA 0.10039
9 3.192 3.414 0.28123 NA 0.24811 0.34678 0.11587
Generosity.2016 Dystopia.Residual.2016 Happiness.Rank.2017 Happiness.Score.2017 Whisker.high.2017
1 0.12071 2.09459 140 3.795 3.951642
2 0.29889 2.48406 90 5.182 5.276336
3 0.25458 2.74924 126 4.280 4.357811
4 NA 2.12944 87 5.227 5.325246
5 0.56521 2.03171 81 5.262 5.352889
6 0.19589 1.81657 117 4.497 4.622591
7 0.23897 3.01402 106 4.709 4.850643
8 0.18077 2.10995 130 4.139 4.345747
9 0.17517 2.13540 150 3.495 3.594038
whisker.low.2017 Economy.2017 Family.2017 Health.2017 Freedom.2017 Generosity.2017 Trust.2017
1 3.638358 0.85842818 1.1044120 0.049868666 NA 0.09792649 0.06972034
2 5.087665 0.98240942 1.0693359 0.705186307 0.20440318 0.32886750 NA
3 4.202190 0.09210235 1.2290235 0.191407025 0.23596135 0.24645583 0.06024136
4 5.128754 1.28948748 1.2394146 0.810198903 0.09573125 NA 0.04328978
5 5.171112 0.99553859 1.2744447 0.492345721 0.44332346 0.61170459 0.01531714
6 4.371409 1.10271049 0.9786132 0.501180470 0.28855553 0.19963726 0.10721575
7 4.567357 0.36842093 0.9841360 0.005564754 0.31869769 0.29304090 0.07109518
8 3.932253 0.65951669 1.2140086 0.290920824 0.01499586 0.18231745 0.08984752
9 3.395962 0.30544472 0.4318825 0.247105569 0.38042614 0.19689615 0.09566502
Dystopia.Residual.2017
1 1.614482
2 1.892173
3 2.224959
4 1.749222
5 1.429477
6 1.318907
7 2.668460
8 1.687066
9 1.837229
Update#1: What I have tried 更新#1:我尝试过的
I have tried apply
function using the code suggested by @RAB. 我已经尝试使用@RAB建议的代码来
apply
函数。 It gave me warning message as below 它给了我如下警告信息
Code used 使用的代码
dt <- apply(df, 1, mean, na.rm=T)
Warning Message 警告信息
1: In mean.default(newX[, i], ...) : argument is not numeric or logical: returning NA
1:在mean.default(newX [,i],...)中:参数不是数字或逻辑:返回NA
str of dataframe 数据框的str
'data.frame': 9 obs. of 35 variables:
$ Country : chr "Angola" "Bosnia and Herzegovina" "Congo (Kinshasa)" "Greece" ...
$ Region : chr "Sub-Saharan Africa" "Central and Eastern Europe" "Sub-Saharan Africa" "Western Europe" ...
$ Happiness.Rank.2015 : int 137 96 120 102 74 112 123 118 158
$ Happiness.Score.2015 : num 4.03 4.95 4.52 4.86 5.4 ...
$ Standard.Error.2015 : num 0.0476 0.0691 0.0368 0.0506 0.026 ...
$ Economy.2015 : num 0.758 0.832 NA 1.154 0.828 ...
$ Family.2015 : num 0.86 0.919 1.001 0.929 1.087 ...
$ Health.2015 : num 0.1668 0.7908 0.0981 0.8821 0.6379 ...
$ Freedom.2015 : num 0.1038 0.0925 0.2261 0.077 0.4661 ...
$ Trust.2015 : num 0.07122 0.00227 0.07625 0.01397 NA ...
$ Generosity.2015 : num 0.123 0.248 0.248 NA 0.515 ...
$ Dystopia.Residual.2015: num 1.95 2.06 2.87 1.8 1.86 ...
$ Region.2016 : chr "Sub-Saharan Africa" "Central and Eastern Europe" "Sub-Saharan Africa" "Western Europe" ...
$ Happiness.Rank.2016 : int 141 87 125 99 79 112 111 133 155
$ Happiness.Score.2016 : num 3.87 5.16 4.27 5.03 5.31 ...
$ Lower.CI.2016 : num 3.75 5.06 4.19 4.93 5.24 ...
$ Upper.CI.2016 : num 3.98 5.26 4.35 5.13 5.39 ...
$ Economy.2016 : num 0.8473 0.9338 0.0566 1.2489 0.951 ...
$ Family.2016 : num 0.664 0.644 0.807 0.755 0.876 ...
$ Health.2016 : num 0.0499 0.7077 0.188 0.8003 0.4937 ...
$ Freedom.2016 : num 0.00589 0.09511 0.15602 0.05822 0.39237 ...
$ Trust.2016 : num 0.08434 NA 0.06075 0.04127 0.00322 ...
$ Generosity.2016 : num 0.121 0.299 0.255 NA 0.565 ...
$ Dystopia.Residual.2016: num 2.09 2.48 2.75 2.13 2.03 ...
$ Happiness.Rank.2017 : int 140 90 126 87 81 117 106 130 150
$ Happiness.Score.2017 : num 3.8 5.18 4.28 5.23 5.26 ...
$ Whisker.high.2017 : num 3.95 5.28 4.36 5.33 5.35 ...
$ whisker.low.2017 : num 3.64 5.09 4.2 5.13 5.17 ...
$ Economy.2017 : num 0.8584 0.9824 0.0921 1.2895 0.9955 ...
$ Family.2017 : num 1.1 1.07 1.23 1.24 1.27 ...
$ Health.2017 : num 0.0499 0.7052 0.1914 0.8102 0.4923 ...
$ Freedom.2017 : num NA 0.2044 0.236 0.0957 0.4433 ...
$ Generosity.2017 : num 0.0979 0.3289 0.2465 NA 0.6117 ...
$ Trust.2017 : num 0.0697 NA 0.0602 0.0433 0.0153 ...
$ Dystopia.Residual.2017: num 1.61 1.89 2.22 1.75 1.43 ...
Note: I am new to R, please provide an explanation along with the code. 注意:我是R的新手,请提供解释以及代码。
Your data needs to be numeric for this to work, so step 1 will be to filter out only the numeric data (we will put the other stuff back in later) 您的数据必须为数字才能正常工作,因此第一步将仅过滤出数字数据(我们稍后将其他数据放回去)
You will need to replace "yourdata" with your dataframe name 您将需要用数据框名称替换“ yourdata”
Step 1: filter for only numeric 步骤1:仅筛选数字
df <- Filter(is.numeric, yourdata)
Step 2: get the means 步骤2:掌握方法
mns <- apply(df, 1, mean, na.rm=T) # this gets the mean of each row
Step 3: find the indexes of the NA values 步骤3:找到NA值的索引
nas <- as.data.frame(which(is.na(df), arr.ind = T))
# the data frame makes it easier to extract the row info for later
Step 4: substitute the NA values with the corresponding mean 步骤4:将NA值替换为相应的平均值
df[which(is.na(df), arr.ind = T)] <- mns[nas$row]
Step 5: combine the non-numeric columns with the new columns 步骤5:将非数字列与新列组合
new_df <- cbind(Filter(Negate(is.numeric), yourdata), df)
Edit: 编辑:
I was bored, so hear's a function for you: 我很无聊,所以听说有一个适合您的功能:
replace_missing <- function(df, groups){
cols <- names(df)
df_char <- Filter(Negate(is.numeric), df)
df_num <- Filter(is.numeric, df)
for(gg in 1:length(groups)){
tmp <- df_num[, grep(groups[gg], names(df_num))]
mns <- apply(tmp, 1, mean, na.rm=T)
nas <- as.data.frame(which(is.na(tmp), arr.ind = T))
if (nrow(nas) > 0){
tmp[which(is.na(tmp), arr.ind = T)] <- mns[nas$row]
}
df_char <- cbind(df_char, tmp)
}
new_df <- cbind(df_char, df[, setdiff(names(df), names(df_char))])
new_df <- new_df[, cols]
}
new_data <- replace_missing(yourdata, groups = c("Happiness.Rank", "Happiness.Score",
"Family", "Economy"))
You can add as many as you want to the groups
field 您可以在“
groups
字段中添加任意多个
Here is a fairly straight-foward tidyverse
solution; 这是一个相当简单的
tidyverse
解决方案。 the key here is to reshape data from wide to long, then "suitably" replace NA
values before transforming data back to wide. 这里的关键是将数据的形状从宽变长,然后“适当地”替换
NA
值,然后再将数据转换回宽。 I give (some) explanations at the end but I encourage you to execute the code line-by-line to understand what every step does. 最后,我给出了一些解释,但我鼓励您逐行执行代码以了解每个步骤的作用。
library(tidyverse)
df.new <- df %>%
gather(key, val, -Country, -Region, -Region.2016) %>%
separate(key, c("what", "when"), sep = "\\.(?=\\d)", remove = FALSE) %>%
group_by(Country, what) %>%
mutate(val = replace(val, is.na(val), mean(val, na.rm = TRUE))) %>%
ungroup() %>%
select(-what, -when) %>%
spread(key, val)
df.new
## A tibble: 9 x 35
# Country Region Region.2016 Dystopia.Residu… Dystopia.Residu… Dystopia.Residu…
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 Angola Sub-S… Sub-Sahara… 1.95 2.09 1.61
#2 Bosnia… Centr… Central an… 2.06 2.48 1.89
#3 Congo … Sub-S… Sub-Sahara… 2.87 2.75 2.22
#4 Greece Weste… Western Eu… 1.80 2.13 1.75
#5 Indone… South… Southeaste… 1.86 2.03 1.43
#6 Iraq Middl… Middle Eas… 1.95 1.82 1.32
#7 Sierra… Sub-S… Sub-Sahara… 2.51 3.01 2.67
#8 Sudan Sub-S… Sub-Sahara… 2.21 2.11 1.69
#9 Togo Sub-S… Sub-Sahara… 1.57 2.14 1.84
## ... with 29 more variables: Economy.2015 <dbl>, Economy.2016 <dbl>,
## Economy.2017 <dbl>, Family.2015 <dbl>, Family.2016 <dbl>,
## Family.2017 <dbl>, Freedom.2015 <dbl>, Freedom.2016 <dbl>,
## Freedom.2017 <dbl>, Generosity.2015 <dbl>, Generosity.2016 <dbl>,
## Generosity.2017 <dbl>, Happiness.Rank.2015 <dbl>,
## Happiness.Rank.2016 <dbl>, Happiness.Rank.2017 <dbl>,
## Happiness.Score.2015 <dbl>, Happiness.Score.2016 <dbl>,
## Happiness.Score.2017 <dbl>, Health.2015 <dbl>, Health.2016 <dbl>,
## Health.2017 <dbl>, Lower.CI.2016 <dbl>, Standard.Error.2015 <dbl>,
## Trust.2015 <dbl>, Trust.2016 <dbl>, Trust.2017 <dbl>, Upper.CI.2016 <dbl>,
## Whisker.high.2017 <dbl>, whisker.low.2017 <dbl>
Explanation: 说明:
Country
, Region
and Region.2016
as they are. Country
, Region
和Region.2016
列Region.2016
。 All other column names are given in new column key
with values in val
. key
中给出,值在val
。 key
entries such as "Happiness.Score.2016"
into "Happiness.Score" (column
what ) and
"2016" (column
when`). key
条目(例如"Happiness.Score.2016"
分隔为"Happiness.Score" (column
列为what ) and
“ 2016” (column
列为when)。 Country
and what
. Country
和what
分组条目。 NA
s per Country
and what
by the mean value across all years. NA
每一台S Country
和what
的所有年的平均值。 ungroup
and remove the what
and when
columns before ungroup
并删除之前的what
和when
列 Mind you, it might actually be a lot easier (and more in line with "tidy" data) to keep your data in long format; 提醒您,实际上,将数据保留为长格式可能会容易得多(并且更符合“整洁”的数据)。 but that's just my opinion.
但那只是我的个人意见。
Let's check for Country == "Congo"
让我们检查
Country == "Congo"
df.new %>% filter(str_detect(Country, "Congo")) %>% select(contains("Economy"))
## A tibble: 1 x 3
# Economy.2015 Economy.2016 Economy.2017
# <dbl> <dbl> <dbl>
#1 0.0744 0.0566 0.0921
and compare with the original data 并与原始数据进行比较
df %>% filter(str_detect(Country, "Congo")) %>% select(contains("Economy"))
# Economy.2015 Economy.2016 Economy.2017
#1 NA 0.05661 0.09210235
So here 0.0744 = 1/2 * (0.05661 + 0.09210235)
. 所以这里
0.0744 = 1/2 * (0.05661 + 0.09210235)
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.