[英]How to replace the missing values with the median for the variable using gsub in R?
我有一个从 Wikipedia 页表的 html 文件中提取的数据框。 我想用每个变量的中位数替换缺失值。
根据给出的提示,我知道我需要将factor
类型转换为numeric
,并且我可能需要使用as.numeric(gsub())
。
renew$Hydro[grep('\\s', renew$Hydro)]
as.numeric(gsub('', median(as.numeric(renew$Hydro)), renew$Hydro))
lapply(renew, function(x) as.numeric(gsub('', median(as.numeric(x)), x)))
我尝试使用grep()
来表明'\\\\s'
是提取空格的模式,但实际上这些空格被排除在输出之外,只显示了数字。
当我尝试使用as.numeric(gsub())
,输出如下所示:
[1] 5.415405e+13 5.475475e+13 5.475425e+07 5.475415e+13 5.400000e+01 5.400000e+01 5.435405e+16
[8] 5.425435e+13 5.400000e+01 5.415455e+16 5.445425e+16 5.415495e+13 5.400000e+01 5.400000e+01
它根本不像看起来像的数据框:
[1] 1035.3 7782 72 7109 30134.8 2351.2 15318
我希望输出看起来与原始数据框完全一样,但空格用列中位数填充。
编辑:这是数据框开头的样子。 它来自“ https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production_from_renewable_sources ”。
> renew
Country Hydro Wind Bio Solar
1 Afghanistan 1035.3 0.1 35.5
2 Albania 7782 1.9
3 Algeria 72 19.4 339.1
4 Angola 7109 155 18.3
5 Anguilla 2.4
6 Antigua and Barbuda 5.5
7 Argentina 30134.8 554.1 1820.4 14.5
8 Armenia 2351.2 1.8 1.2
9 Aruba 130.3 8.9 9.2
10 Australia 15318 12199 3722 6209
11 Austria 42919 5235 4603 1096
12 Azerbaijan 1959.3 22.8 174.5 35.3
13 Bahamas 1.9
14 Bahrain 1.2 8.3
15 Bangladesh 946 5.1 7.7 224.3
由于您的数据框中有空格,因此列被转换为字符,并且取字符列的median
没有任何意义。 我们可以先将空格替换为NA
,将列转换为数字,然后replace
NA
replace
为列的median
。 使用dplyr
我们可以执行以下步骤。
library(dplyr)
renew[renew == ""] <- NA
renew %>%
mutate_at(-1, as.numeric) %>% #-1 is to ignore Country column
mutate_at(-1, ~ replace(., is.na(.), median(., na.rm = TRUE)))
# Country Hydro Wind Bio Solar
#1 Afghanistan 1035.3 0.1 174.5 35.5
#2 Albania 7782.0 21.1 174.5 1.9
#3 Algeria 72.0 19.4 174.5 339.1
#4 Angola 7109.0 21.1 155.0 18.3
#5 Anguilla 4730.1 21.1 174.5 2.4
#6 AntiguaandBarbuda 4730.1 21.1 174.5 5.5
#7 Argentina 30134.8 554.1 1820.4 14.5
#8 Armenia 2351.2 1.8 174.5 1.2
#9 Aruba 4730.1 130.3 8.9 9.2
#10 Australia 15318.0 12199.0 3722.0 6209.0
#11 Austria 42919.0 5235.0 4603.0 1096.0
#12 Azerbaijan 1959.3 22.8 174.5 35.3
#13 Bahamas 4730.1 21.1 174.5 1.9
#14 Bahrain 4730.1 1.2 174.5 8.3
#15 Bangladesh 946.0 5.1 7.7 224.3
我们可以使用基础 R 做同样的事情
renew[renew == ""] <- NA
renew[-1] <- lapply(renew[-1], function(x)
as.numeric(replace(x, is.na(x), median(as.numeric(x), na.rm = TRUE))))
我们可以使用zoo
na.aggregate
以紧凑的方式做到这一点
library(dplyr)
library(hablar)
library(zoo)
renew %>%
retype %>% # change the type of columns
# replace missing value of numeric columns with median
mutate_if(is.numeric, na.aggregate, FUN = median)
# A tibble: 15 x 5
# Country Hydro Wind Bio Solar
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Afghanistan 1035. 0.1 174. 35.5
# 2 Albania 7782 21.1 174. 1.9
# 3 Algeria 72 19.4 174. 339.
# 4 Angola 7109 21.1 155 18.3
# 5 Anguilla 4730. 21.1 174. 2.4
# 6 Antigua and Barbuda 4730. 21.1 174. 5.5
# 7 Argentina 30135. 554. 1820. 14.5
# 8 Armenia 2351. 1.8 174. 1.2
# 9 Aruba 4730. 130. 8.9 9.2
#10 Australia 15318 12199 3722 6209
#11 Austria 42919 5235 4603 1096
#12 Azerbaijan 1959. 22.8 174. 35.3
#13 Bahamas 4730. 21.1 174. 1.9
#14 Bahrain 4730. 1.2 174. 8.3
#15 Bangladesh 946 5.1 7.7 224.
renew <- structure(list(Country = c("Afghanistan", "Albania", "Algeria",
"Angola", "Anguilla", "Antigua and Barbuda", "Argentina", "Armenia",
"Aruba", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain",
"Bangladesh"), Hydro = c("1035.3", "7782", "72", "7109", "",
"", "30134.8", "2351.2", "", "15318", "42919", "1959.3", "",
"", "946"), Wind = c("0.1", "", "19.4", "", "", "", "554.1",
"1.8", "130.3", "12199", "5235", "22.8", "", "1.2", "5.1"), Bio = c("",
"", "", "155", "", "", "1820.4", "", "8.9", "3722", "4603", "174.5",
"", "", "7.7"), Solar = c(35.5, 1.9, 339.1, 18.3, 2.4, 5.5, 14.5,
1.2, 9.2, 6209, 1096, 35.3, 1.9, 8.3, 224.3)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"), class = "data.frame")
我想指出的是,数据在抓取之后还不干净,因为lapply(renew, function(x) grep(",", x))
产生一些东西。
首先使用gsub
清理它,以避免在将数据转换为数字时将这些值转换为NA
。 这是一个一步解决方案,自动创建正确的NA
:
renew[-1] <- lapply(renew[-1], function(x) as.numeric(as.character(gsub(",", ".", x))))
之后你可以运行一个sapply
# sapply(2:5, function(x) renew[[x]][is.na(renew[[x]])] <<- median(renew[[x]], na.rm=TRUE))
或者当然对@Ronak Shah 的第二个基本 R 代码行进行更短的改编,这要好得多:
renew[-1] <- sapply(renew[-1], function(x) replace(x, is.na(x), median(x, na.rm=TRUE)))
结果
summary(renew)
# country hydro wind bio solar
# Afghanistan : 1 Min. : 0.8 Min. : 0.00 Min. : 0.2 Min. : 0.1
# Albania : 1 1st Qu.: 907.8 1st Qu.: 50.45 1st Qu.: 151.1 1st Qu.: 4.8
# Algeria : 1 Median : 2595.0 Median : 109.00 Median : 242.5 Median : 22.3
# Angola : 1 Mean : 19989.3 Mean : 4324.13 Mean : 2136.3 Mean : 1483.3
# Anguilla : 1 3rd Qu.: 7992.4 3rd Qu.: 293.55 3rd Qu.: 344.4 3rd Qu.: 124.5
# Antigua and Barbuda: 1 Max. :1193370.0 Max. :242387.70 Max. :69017.0 Max. :67874.1
# (Other) :209
数据
library(rvest)
renew <- setNames(html_table(
read_html(paste0("https://en.wikipedia.org/wiki/List_of_countries",
"_by_electricity_production_from_renewable_sources")),
fill=TRUE, header=TRUE)[[1]][c(1, 6:9)], c("country", "hydro", "wind", "bio", "solar"))
renew$country <- factor(renew$country)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.