繁体   English   中英

如何在R中使用gsub用变量的中位数替换缺失值?

[英]How to replace the missing values with the median for the variable using gsub in R?

我有一个从 Wikipedia 页表的 html 文件中提取的数据框。 我想用每个变量的中位数替换缺失值。

根据给出的提示,我知道我需要将factor类型转换为numeric ,并且我可能需要使用as.numeric(gsub())

renew$Hydro[grep('\\s', renew$Hydro)]
as.numeric(gsub('', median(as.numeric(renew$Hydro)), renew$Hydro))
lapply(renew, function(x) as.numeric(gsub('', median(as.numeric(x)), x)))

我尝试使用grep()来表明'\\\\s'是提取空格的模式,但实际上这些空格被排除在输出之外,只显示了数字。

当我尝试使用as.numeric(gsub()) ,输出如下所示:

[1] 5.415405e+13 5.475475e+13 5.475425e+07 5.475415e+13 5.400000e+01 5.400000e+01 5.435405e+16
[8] 5.425435e+13 5.400000e+01 5.415455e+16 5.445425e+16 5.415495e+13 5.400000e+01 5.400000e+01

它根本不像看起来像的数据框:

[1] 1035.3   7782     72       7109                       30134.8  2351.2            15318   

我希望输出看起来与原始数据框完全一样,但空格用列中位数填充。

编辑:这是数据框开头的样子。 它来自“ https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production_from_renewable_sources ”。

> renew
                             Country    Hydro     Wind     Bio   Solar
1                        Afghanistan   1035.3      0.1            35.5
2                            Albania     7782                      1.9
3                            Algeria       72     19.4           339.1
4                             Angola     7109              155    18.3
5                           Anguilla                               2.4
6                Antigua and Barbuda                               5.5
7                          Argentina  30134.8    554.1  1820.4    14.5
8                            Armenia   2351.2      1.8             1.2
9                              Aruba             130.3     8.9     9.2
10                         Australia    15318    12199    3722    6209
11                           Austria    42919     5235    4603    1096
12                        Azerbaijan   1959.3     22.8   174.5    35.3
13                           Bahamas                               1.9
14                           Bahrain               1.2             8.3
15                        Bangladesh      946      5.1     7.7   224.3

由于您的数据框中有空格,因此列被转换为字符,并且取字符列的median没有任何意义。 我们可以先将空格替换为NA ,将列转换为数字,然后replace NA replace为列的median 使用dplyr我们可以执行以下步骤。

library(dplyr)
renew[renew == ""] <- NA

renew %>%
   mutate_at(-1, as.numeric) %>% #-1 is to ignore Country column
   mutate_at(-1, ~ replace(., is.na(.), median(., na.rm = TRUE)))


#             Country   Hydro    Wind    Bio  Solar
#1        Afghanistan  1035.3     0.1  174.5   35.5
#2            Albania  7782.0    21.1  174.5    1.9
#3            Algeria    72.0    19.4  174.5  339.1
#4             Angola  7109.0    21.1  155.0   18.3
#5           Anguilla  4730.1    21.1  174.5    2.4
#6  AntiguaandBarbuda  4730.1    21.1  174.5    5.5
#7          Argentina 30134.8   554.1 1820.4   14.5
#8            Armenia  2351.2     1.8  174.5    1.2
#9              Aruba  4730.1   130.3    8.9    9.2
#10         Australia 15318.0 12199.0 3722.0 6209.0
#11           Austria 42919.0  5235.0 4603.0 1096.0
#12        Azerbaijan  1959.3    22.8  174.5   35.3
#13           Bahamas  4730.1    21.1  174.5    1.9
#14           Bahrain  4730.1     1.2  174.5    8.3
#15        Bangladesh   946.0     5.1    7.7  224.3

我们可以使用基础 R 做同样的事情

renew[renew == ""] <- NA
renew[-1] <- lapply(renew[-1], function(x) 
      as.numeric(replace(x, is.na(x), median(as.numeric(x), na.rm = TRUE))))

我们可以使用zoo na.aggregate以紧凑的方式做到这一点

library(dplyr)
library(hablar)
library(zoo)
renew %>%
    retype %>% # change the type of columns
    # replace missing value of numeric columns with median
     mutate_if(is.numeric, na.aggregate, FUN = median)
# A tibble: 15 x 5
#   Country              Hydro    Wind    Bio  Solar
#   <chr>                <dbl>   <dbl>  <dbl>  <dbl>
# 1 Afghanistan          1035.     0.1  174.    35.5
# 2 Albania              7782     21.1  174.     1.9
# 3 Algeria                72     19.4  174.   339. 
# 4 Angola               7109     21.1  155     18.3
# 5 Anguilla             4730.    21.1  174.     2.4
# 6 Antigua and Barbuda  4730.    21.1  174.     5.5
# 7 Argentina           30135.   554.  1820.    14.5
# 8 Armenia              2351.     1.8  174.     1.2
# 9 Aruba                4730.   130.     8.9    9.2
#10 Australia           15318  12199   3722   6209  
#11 Austria             42919   5235   4603   1096  
#12 Azerbaijan           1959.    22.8  174.    35.3
#13 Bahamas              4730.    21.1  174.     1.9
#14 Bahrain              4730.     1.2  174.     8.3
#15 Bangladesh            946      5.1    7.7  224. 

数据

renew <- structure(list(Country = c("Afghanistan", "Albania", "Algeria", 
"Angola", "Anguilla", "Antigua and Barbuda", "Argentina", "Armenia", 
"Aruba", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain", 
"Bangladesh"), Hydro = c("1035.3", "7782", "72", "7109", "", 
"", "30134.8", "2351.2", "", "15318", "42919", "1959.3", "", 
"", "946"), Wind = c("0.1", "", "19.4", "", "", "", "554.1", 
"1.8", "130.3", "12199", "5235", "22.8", "", "1.2", "5.1"), Bio = c("", 
"", "", "155", "", "", "1820.4", "", "8.9", "3722", "4603", "174.5", 
"", "", "7.7"), Solar = c(35.5, 1.9, 339.1, 18.3, 2.4, 5.5, 14.5, 
1.2, 9.2, 6209, 1096, 35.3, 1.9, 8.3, 224.3)), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"), class = "data.frame")

我想指出的是,数据在抓取之后还不干净,因为lapply(renew, function(x) grep(",", x))产生一些东西。

首先使用gsub清理它,以避免在将数据转换为数字时将这些值转换为NA 这是一个一步解决方案,自动创建正确的NA

renew[-1] <- lapply(renew[-1], function(x) as.numeric(as.character(gsub(",", ".", x))))

之后你可以运行一个sapply

# sapply(2:5, function(x) renew[[x]][is.na(renew[[x]])] <<- median(renew[[x]], na.rm=TRUE))

或者当然对@Ronak Shah 的第二个基本 R 代码行进行更短的改编,这要好得多:

renew[-1] <- sapply(renew[-1], function(x) replace(x, is.na(x), median(x, na.rm=TRUE)))

结果

summary(renew)
#                      country        hydro                wind                bio              solar        
# Afghanistan        :  1   Min.   :      0.8   Min.   :     0.00   Min.   :    0.2   Min.   :    0.1  
# Albania            :  1   1st Qu.:    907.8   1st Qu.:    50.45   1st Qu.:  151.1   1st Qu.:    4.8  
# Algeria            :  1   Median :   2595.0   Median :   109.00   Median :  242.5   Median :   22.3  
# Angola             :  1   Mean   :  19989.3   Mean   :  4324.13   Mean   : 2136.3   Mean   : 1483.3  
# Anguilla           :  1   3rd Qu.:   7992.4   3rd Qu.:   293.55   3rd Qu.:  344.4   3rd Qu.:  124.5  
# Antigua and Barbuda:  1   Max.   :1193370.0   Max.   :242387.70   Max.   :69017.0   Max.   :67874.1  
# (Other)            :209                                                                              

数据

library(rvest)
renew <- setNames(html_table(
  read_html(paste0("https://en.wikipedia.org/wiki/List_of_countries",
                   "_by_electricity_production_from_renewable_sources")),
  fill=TRUE, header=TRUE)[[1]][c(1, 6:9)], c("country", "hydro", "wind", "bio", "solar"))
renew$country <- factor(renew$country)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM