简体   繁体   English

如何通过取R中的两列均值来填充NA?

[英]How to fill NA by taking mean of two columns in R?

Below is the dput of my dataset. 下面是dput我的数据集。 I am trying to fill my dataset such that if there is NA present in particular column of the year, then the NA should be filled with mean of other two years. 我正在尝试填充我的数据集,以便如果该年的特定列中存在NA ,则该NA应以其他两年的mean填充。 For example, in the dataset below, Congo contains NA for the "Economy.2015" column, so that NA should be filled with mean from the columns "Economy.2016" and "Economy.2017". 例如,在下面的数据集中,刚果在“ Economy.2015”列中包含NA ,因此应使用“ Economy.2016”和“ Economy.2017”列中的均值填充NA

dput dput

structure(list(Country = c("Angola", "Bosnia and Herzegovina", 
"Congo (Kinshasa)", "Greece", "Indonesia", "Iraq", "Sierra Leone", 
"Sudan", "Togo"), Region = c("Sub-Saharan Africa", "Central and Eastern Europe", 
"Sub-Saharan Africa", "Western Europe", "Southeastern Asia", 
"Middle East and Northern Africa", "Sub-Saharan Africa", "Sub-Saharan Africa", 
"Sub-Saharan Africa"), Happiness.Rank.2015 = c(137L, 96L, 120L, 
102L, 74L, 112L, 123L, 118L, 158L), Happiness.Score.2015 = c(4.033, 
4.949, 4.517, 4.857, 5.399, 4.677, 4.507, 4.55, 2.839), Standard.Error.2015 = c(0.04758, 
0.06913, 0.0368, 0.05062, 0.02596, 0.05232, 0.07068, 0.0674, 
0.06727), Economy.2015 = c(0.75778, 0.83223, NA, 1.15406, 0.82827, 
0.98549, 0.33024, 0.52107, 0.20868), Family.2015 = c(0.8604, 
0.91916, 1.0012, 0.92933, 1.08708, 0.81889, 0.95571, 1.01404, 
0.13995), Health.2015 = c(0.16683, 0.79081, 0.09806, 0.88213, 
0.63793, 0.60237, NA, 0.36878, 0.28443), Freedom.2015 = c(0.10384, 
0.09245, 0.22605, 0.07699, 0.46611, NA, 0.4084, 0.10081, 0.36453
), Trust.2015 = c(0.07122, 0.00227, 0.07625, 0.01397, NA, 0.13788, 
0.08786, 0.1466, 0.10731), Generosity.2015 = c(0.12344, 0.24808, 
0.24834, NA, 0.51535, 0.17922, 0.21488, 0.19062, 0.16681), Dystopia.Residual.2015 = c(1.94939, 
2.06367, 2.86712, 1.80101, 1.86399, 1.95335, 2.51009, 2.20857, 
1.56726), Region.2016 = c("Sub-Saharan Africa", "Central and Eastern Europe", 
"Sub-Saharan Africa", "Western Europe", "Southeastern Asia", 
"Middle East and Northern Africa", "Sub-Saharan Africa", "Sub-Saharan Africa", 
"Sub-Saharan Africa"), Happiness.Rank.2016 = c(141L, 87L, 125L, 
99L, 79L, 112L, 111L, 133L, 155L), Happiness.Score.2016 = c(3.866, 
5.163, 4.272, 5.033, 5.314, 4.575, 4.635, 4.139, 3.303), Lower.CI.2016 = c(3.753, 
5.063, 4.191, 4.935, 5.237, 4.446, 4.505, 3.928, 3.192), Upper.CI.2016 = c(3.979, 
5.263, 4.353, 5.131, 5.391, 4.704, 4.765, 4.35, 3.414), Economy.2016 = c(0.84731, 
0.93383, 0.05661, 1.24886, 0.95104, 1.07474, 0.36485, 0.63069, 
0.28123), Family.2016 = c(0.66366, 0.64367, 0.80676, 0.75473, 
0.87625, 0.59205, 0.628, 0.81928, NA), Health.2016 = c(0.04991, 
0.70766, 0.188, 0.80029, 0.49374, 0.51076, NA, 0.29759, 0.24811
), Freedom.2016 = c(0.00589, 0.09511, 0.15602, 0.05822, 0.39237, 
0.24856, 0.30685, NA, 0.34678), Trust.2016 = c(0.08434, NA, 0.06075, 
0.04127, 0.00322, 0.13636, 0.08196, 0.10039, 0.11587), Generosity.2016 = c(0.12071, 
0.29889, 0.25458, NA, 0.56521, 0.19589, 0.23897, 0.18077, 0.17517
), Dystopia.Residual.2016 = c(2.09459, 2.48406, 2.74924, 2.12944, 
2.03171, 1.81657, 3.01402, 2.10995, 2.1354), Happiness.Rank.2017 = c(140L, 
90L, 126L, 87L, 81L, 117L, 106L, 130L, 150L), Happiness.Score.2017 = c(3.79500007629395, 
5.18200016021729, 4.28000020980835, 5.22700023651123, 5.26200008392334, 
4.49700021743774, 4.70900011062622, 4.13899993896484, 3.49499988555908
), Whisker.high.2017 = c(3.95164193540812, 5.27633568674326, 
4.35781083270907, 5.3252461694181, 5.35288859814405, 4.62259140968323, 
4.85064333498478, 4.34574716508389, 3.59403811171651), whisker.low.2017 = c(3.63835821717978, 
5.08766463369131, 4.20218958690763, 5.12875430360436, 5.17111156970263, 
4.37140902519226, 4.56735688626766, 3.9322527128458, 3.39596165940166
), Economy.2017 = c(0.858428180217743, 0.982409417629242, 0.0921023488044739, 
1.28948748111725, 0.995538592338562, 1.10271048545837, 0.36842092871666, 
0.65951669216156, 0.305444717407227), Family.2017 = c(1.10441195964813, 
1.0693359375, 1.22902345657349, 1.23941457271576, 1.27444469928741, 
0.978613197803497, 0.984136044979095, 1.21400856971741, 0.431882530450821
), Health.2017 = c(0.0498686656355858, 0.705186307430267, 0.191407024860382, 
0.810198903083801, 0.492345720529556, 0.501180469989777, 0.00556475389748812, 
0.290920823812485, 0.247105568647385), Freedom.2017 = c(NA, 0.204403176903725, 
0.235961347818375, 0.0957312509417534, 0.443323463201523, 0.288555532693863, 
0.318697690963745, 0.0149958552792668, 0.38042613863945), Generosity.2017 = c(0.097926490008831, 
0.328867495059967, 0.246455833315849, NA, 0.611704587936401, 
0.19963726401329, 0.293040901422501, 0.182317450642586, 0.196896150708199
), Trust.2017 = c(0.0697203353047371, NA, 0.0602413564920425, 
0.04328977689147, 0.0153171354904771, 0.107215754687786, 0.0710951760411263, 
0.089847519993782, 0.0956650152802467), Dystopia.Residual.2017 = c(1.61448240280151, 
1.89217257499695, 2.22495865821838, 1.74922156333923, 1.42947697639465, 
1.31890726089478, 2.66845989227295, 1.68706583976746, 1.83722925186157
)), class = "data.frame", row.names = c(NA, -9L))

Structure of dataframe 数据框的结构

   Country                          Region Happiness.Rank.2015 Happiness.Score.2015
1                 Angola              Sub-Saharan Africa                 137                4.033
2 Bosnia and Herzegovina      Central and Eastern Europe                  96                4.949
3       Congo (Kinshasa)              Sub-Saharan Africa                 120                4.517
4                 Greece                  Western Europe                 102                4.857
5              Indonesia               Southeastern Asia                  74                5.399
6                   Iraq Middle East and Northern Africa                 112                4.677
7           Sierra Leone              Sub-Saharan Africa                 123                4.507
8                  Sudan              Sub-Saharan Africa                 118                4.550
9                   Togo              Sub-Saharan Africa                 158                2.839
  Standard.Error.2015 Economy.2015 Family.2015 Health.2015 Freedom.2015 Trust.2015 Generosity.2015
1             0.04758      0.75778     0.86040     0.16683      0.10384    0.07122         0.12344
2             0.06913      0.83223     0.91916     0.79081      0.09245    0.00227         0.24808
3             0.03680           NA     1.00120     0.09806      0.22605    0.07625         0.24834
4             0.05062      1.15406     0.92933     0.88213      0.07699    0.01397              NA
5             0.02596      0.82827     1.08708     0.63793      0.46611         NA         0.51535
6             0.05232      0.98549     0.81889     0.60237           NA    0.13788         0.17922
7             0.07068      0.33024     0.95571          NA      0.40840    0.08786         0.21488
8             0.06740      0.52107     1.01404     0.36878      0.10081    0.14660         0.19062
9             0.06727      0.20868     0.13995     0.28443      0.36453    0.10731         0.16681
  Dystopia.Residual.2015                     Region.2016 Happiness.Rank.2016 Happiness.Score.2016
1                1.94939              Sub-Saharan Africa                 141                3.866
2                2.06367      Central and Eastern Europe                  87                5.163
3                2.86712              Sub-Saharan Africa                 125                4.272
4                1.80101                  Western Europe                  99                5.033
5                1.86399               Southeastern Asia                  79                5.314
6                1.95335 Middle East and Northern Africa                 112                4.575
7                2.51009              Sub-Saharan Africa                 111                4.635
8                2.20857              Sub-Saharan Africa                 133                4.139
9                1.56726              Sub-Saharan Africa                 155                3.303
  Lower.CI.2016 Upper.CI.2016 Economy.2016 Family.2016 Health.2016 Freedom.2016 Trust.2016
1         3.753         3.979      0.84731     0.66366     0.04991      0.00589    0.08434
2         5.063         5.263      0.93383     0.64367     0.70766      0.09511         NA
3         4.191         4.353      0.05661     0.80676     0.18800      0.15602    0.06075
4         4.935         5.131      1.24886     0.75473     0.80029      0.05822    0.04127
5         5.237         5.391      0.95104     0.87625     0.49374      0.39237    0.00322
6         4.446         4.704      1.07474     0.59205     0.51076      0.24856    0.13636
7         4.505         4.765      0.36485     0.62800          NA      0.30685    0.08196
8         3.928         4.350      0.63069     0.81928     0.29759           NA    0.10039
9         3.192         3.414      0.28123          NA     0.24811      0.34678    0.11587
  Generosity.2016 Dystopia.Residual.2016 Happiness.Rank.2017 Happiness.Score.2017 Whisker.high.2017
1         0.12071                2.09459                 140                3.795          3.951642
2         0.29889                2.48406                  90                5.182          5.276336
3         0.25458                2.74924                 126                4.280          4.357811
4              NA                2.12944                  87                5.227          5.325246
5         0.56521                2.03171                  81                5.262          5.352889
6         0.19589                1.81657                 117                4.497          4.622591
7         0.23897                3.01402                 106                4.709          4.850643
8         0.18077                2.10995                 130                4.139          4.345747
9         0.17517                2.13540                 150                3.495          3.594038
  whisker.low.2017 Economy.2017 Family.2017 Health.2017 Freedom.2017 Generosity.2017 Trust.2017
1         3.638358   0.85842818   1.1044120 0.049868666           NA      0.09792649 0.06972034
2         5.087665   0.98240942   1.0693359 0.705186307   0.20440318      0.32886750         NA
3         4.202190   0.09210235   1.2290235 0.191407025   0.23596135      0.24645583 0.06024136
4         5.128754   1.28948748   1.2394146 0.810198903   0.09573125              NA 0.04328978
5         5.171112   0.99553859   1.2744447 0.492345721   0.44332346      0.61170459 0.01531714
6         4.371409   1.10271049   0.9786132 0.501180470   0.28855553      0.19963726 0.10721575
7         4.567357   0.36842093   0.9841360 0.005564754   0.31869769      0.29304090 0.07109518
8         3.932253   0.65951669   1.2140086 0.290920824   0.01499586      0.18231745 0.08984752
9         3.395962   0.30544472   0.4318825 0.247105569   0.38042614      0.19689615 0.09566502
  Dystopia.Residual.2017
1               1.614482
2               1.892173
3               2.224959
4               1.749222
5               1.429477
6               1.318907
7               2.668460
8               1.687066
9               1.837229

Update#1: What I have tried 更新#1:我尝试过的

I have tried apply function using the code suggested by @RAB. 我已经尝试使用@RAB建议的代码来apply函数。 It gave me warning message as below 它给了我如下警告信息

Code used 使用的代码

dt <- apply(df, 1, mean, na.rm=T)

Warning Message 警告信息

1: In mean.default(newX[, i], ...) : argument is not numeric or logical: returning NA 1:在mean.default(newX [,i],...)中:参数不是数字或逻辑:返回NA

str of dataframe 数据框的str

'data.frame':   9 obs. of  35 variables:
 $ Country               : chr  "Angola" "Bosnia and Herzegovina" "Congo (Kinshasa)" "Greece" ...
 $ Region                : chr  "Sub-Saharan Africa" "Central and Eastern Europe" "Sub-Saharan Africa" "Western Europe" ...
 $ Happiness.Rank.2015   : int  137 96 120 102 74 112 123 118 158
 $ Happiness.Score.2015  : num  4.03 4.95 4.52 4.86 5.4 ...
 $ Standard.Error.2015   : num  0.0476 0.0691 0.0368 0.0506 0.026 ...
 $ Economy.2015          : num  0.758 0.832 NA 1.154 0.828 ...
 $ Family.2015           : num  0.86 0.919 1.001 0.929 1.087 ...
 $ Health.2015           : num  0.1668 0.7908 0.0981 0.8821 0.6379 ...
 $ Freedom.2015          : num  0.1038 0.0925 0.2261 0.077 0.4661 ...
 $ Trust.2015            : num  0.07122 0.00227 0.07625 0.01397 NA ...
 $ Generosity.2015       : num  0.123 0.248 0.248 NA 0.515 ...
 $ Dystopia.Residual.2015: num  1.95 2.06 2.87 1.8 1.86 ...
 $ Region.2016           : chr  "Sub-Saharan Africa" "Central and Eastern Europe" "Sub-Saharan Africa" "Western Europe" ...
 $ Happiness.Rank.2016   : int  141 87 125 99 79 112 111 133 155
 $ Happiness.Score.2016  : num  3.87 5.16 4.27 5.03 5.31 ...
 $ Lower.CI.2016         : num  3.75 5.06 4.19 4.93 5.24 ...
 $ Upper.CI.2016         : num  3.98 5.26 4.35 5.13 5.39 ...
 $ Economy.2016          : num  0.8473 0.9338 0.0566 1.2489 0.951 ...
 $ Family.2016           : num  0.664 0.644 0.807 0.755 0.876 ...
 $ Health.2016           : num  0.0499 0.7077 0.188 0.8003 0.4937 ...
 $ Freedom.2016          : num  0.00589 0.09511 0.15602 0.05822 0.39237 ...
 $ Trust.2016            : num  0.08434 NA 0.06075 0.04127 0.00322 ...
 $ Generosity.2016       : num  0.121 0.299 0.255 NA 0.565 ...
 $ Dystopia.Residual.2016: num  2.09 2.48 2.75 2.13 2.03 ...
 $ Happiness.Rank.2017   : int  140 90 126 87 81 117 106 130 150
 $ Happiness.Score.2017  : num  3.8 5.18 4.28 5.23 5.26 ...
 $ Whisker.high.2017     : num  3.95 5.28 4.36 5.33 5.35 ...
 $ whisker.low.2017      : num  3.64 5.09 4.2 5.13 5.17 ...
 $ Economy.2017          : num  0.8584 0.9824 0.0921 1.2895 0.9955 ...
 $ Family.2017           : num  1.1 1.07 1.23 1.24 1.27 ...
 $ Health.2017           : num  0.0499 0.7052 0.1914 0.8102 0.4923 ...
 $ Freedom.2017          : num  NA 0.2044 0.236 0.0957 0.4433 ...
 $ Generosity.2017       : num  0.0979 0.3289 0.2465 NA 0.6117 ...
 $ Trust.2017            : num  0.0697 NA 0.0602 0.0433 0.0153 ...
 $ Dystopia.Residual.2017: num  1.61 1.89 2.22 1.75 1.43 ...

Note: I am new to R, please provide an explanation along with the code. 注意:我是R的新手,请提供解释以及代码。

Your data needs to be numeric for this to work, so step 1 will be to filter out only the numeric data (we will put the other stuff back in later) 您的数据必须为数字才能正常工作,因此第一步将仅过滤出数字数据(我们稍后将其他数据放回去)

You will need to replace "yourdata" with your dataframe name 您将需要用数据框名称替换“ yourdata”

Step 1: filter for only numeric 步骤1:仅筛选数字

df <- Filter(is.numeric, yourdata)

Step 2: get the means 步骤2:掌握方法

mns <- apply(df, 1, mean, na.rm=T) # this gets the mean of each row

Step 3: find the indexes of the NA values 步骤3:找到NA值的索引

nas <- as.data.frame(which(is.na(df), arr.ind = T)) 
# the data frame makes it easier to extract the row info for later

Step 4: substitute the NA values with the corresponding mean 步骤4:将NA值替换为相应的平均值

df[which(is.na(df), arr.ind = T)] <- mns[nas$row]

Step 5: combine the non-numeric columns with the new columns 步骤5:将非数字列与新列组合

new_df <- cbind(Filter(Negate(is.numeric), yourdata), df)

Edit: 编辑:

I was bored, so hear's a function for you: 我很无聊,所以听说有一个适合您的功能:

replace_missing <- function(df, groups){

  cols <- names(df)

  df_char <- Filter(Negate(is.numeric), df)
  df_num  <- Filter(is.numeric, df)

  for(gg in 1:length(groups)){
    tmp <- df_num[, grep(groups[gg], names(df_num))]
    mns <- apply(tmp, 1, mean, na.rm=T)
    nas <- as.data.frame(which(is.na(tmp), arr.ind = T))
    if (nrow(nas) > 0){
      tmp[which(is.na(tmp), arr.ind = T)] <- mns[nas$row]
    }
    df_char <- cbind(df_char, tmp)
  }

  new_df <- cbind(df_char, df[, setdiff(names(df), names(df_char))])
  new_df <- new_df[, cols]
}

new_data <- replace_missing(yourdata, groups = c("Happiness.Rank", "Happiness.Score", 
                            "Family", "Economy"))

You can add as many as you want to the groups field 您可以在“ groups字段中添加任意多个

Here is a fairly straight-foward tidyverse solution; 这是一个相当简单的tidyverse解决方案。 the key here is to reshape data from wide to long, then "suitably" replace NA values before transforming data back to wide. 这里的关键是将数据的形状从宽变长,然后“适当地”替换NA值,然后再将数据转换回宽。 I give (some) explanations at the end but I encourage you to execute the code line-by-line to understand what every step does. 最后,我给出了一些解释,但我鼓励您逐行执行代码以了解每个步骤的作用。

library(tidyverse)
df.new <- df %>%
    gather(key, val, -Country, -Region, -Region.2016) %>%
    separate(key, c("what", "when"), sep = "\\.(?=\\d)", remove = FALSE) %>%
    group_by(Country, what) %>%
    mutate(val = replace(val, is.na(val), mean(val, na.rm = TRUE))) %>%
    ungroup() %>%
    select(-what, -when) %>%
    spread(key, val)
df.new
## A tibble: 9 x 35
#  Country Region Region.2016 Dystopia.Residu… Dystopia.Residu… Dystopia.Residu…
#  <chr>   <chr>  <chr>                  <dbl>            <dbl>            <dbl>
#1 Angola  Sub-S… Sub-Sahara…             1.95             2.09             1.61
#2 Bosnia… Centr… Central an…             2.06             2.48             1.89
#3 Congo … Sub-S… Sub-Sahara…             2.87             2.75             2.22
#4 Greece  Weste… Western Eu…             1.80             2.13             1.75
#5 Indone… South… Southeaste…             1.86             2.03             1.43
#6 Iraq    Middl… Middle Eas…             1.95             1.82             1.32
#7 Sierra… Sub-S… Sub-Sahara…             2.51             3.01             2.67
#8 Sudan   Sub-S… Sub-Sahara…             2.21             2.11             1.69
#9 Togo    Sub-S… Sub-Sahara…             1.57             2.14             1.84
## ... with 29 more variables: Economy.2015 <dbl>, Economy.2016 <dbl>,
##   Economy.2017 <dbl>, Family.2015 <dbl>, Family.2016 <dbl>,
##   Family.2017 <dbl>, Freedom.2015 <dbl>, Freedom.2016 <dbl>,
##   Freedom.2017 <dbl>, Generosity.2015 <dbl>, Generosity.2016 <dbl>,
##   Generosity.2017 <dbl>, Happiness.Rank.2015 <dbl>,
##   Happiness.Rank.2016 <dbl>, Happiness.Rank.2017 <dbl>,
##   Happiness.Score.2015 <dbl>, Happiness.Score.2016 <dbl>,
##   Happiness.Score.2017 <dbl>, Health.2015 <dbl>, Health.2016 <dbl>,
##   Health.2017 <dbl>, Lower.CI.2016 <dbl>, Standard.Error.2015 <dbl>,
##   Trust.2015 <dbl>, Trust.2016 <dbl>, Trust.2017 <dbl>, Upper.CI.2016 <dbl>,
##   Whisker.high.2017 <dbl>, whisker.low.2017 <dbl>

Explanation: 说明:

  1. Reshape data from wide to long; 从宽到长重塑数据; keep columns Country , Region and Region.2016 as they are. 保持CountryRegionRegion.2016Region.2016 All other column names are given in new column key with values in val . 所有其他列名称在新列key中给出,值在val
  2. Separate all key entries such as "Happiness.Score.2016" into "Happiness.Score" (column what ) and "2016" (column when`). 将所有key条目(例如"Happiness.Score.2016"分隔为"Happiness.Score" (column列为what ) and “ 2016” (column列为when)。
  3. Group entries by Country and what . Countrywhat分组条目。
  4. We can now replace NA s per Country and what by the mean value across all years. 现在,我们可以更换NA每一台S Countrywhat的所有年的平均值。
  5. Lastly, ungroup and remove the what and when columns before 最后, ungroup并删除之前的whatwhen
  6. reshaping data from long to wide again to be consistent with original data format. 再次将数据从长形更改为宽形,以与原始数据格式保持一致。

Mind you, it might actually be a lot easier (and more in line with "tidy" data) to keep your data in long format; 提醒您,实际上,将数据保留为长格式可能会容易得多(并且更符合“整洁”的数据)。 but that's just my opinion. 但那只是我的个人意见。


Let's check for Country == "Congo" 让我们检查Country == "Congo"

df.new %>% filter(str_detect(Country, "Congo")) %>% select(contains("Economy"))
## A tibble: 1 x 3
#  Economy.2015 Economy.2016 Economy.2017
#         <dbl>        <dbl>        <dbl>
#1       0.0744       0.0566       0.0921

and compare with the original data 并与原始数据进行比较

df %>% filter(str_detect(Country, "Congo")) %>% select(contains("Economy"))
#  Economy.2015 Economy.2016 Economy.2017
#1           NA      0.05661   0.09210235

So here 0.0744 = 1/2 * (0.05661 + 0.09210235) . 所以这里0.0744 = 1/2 * (0.05661 + 0.09210235)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM