I have a data frame that I extracted from an html file of a Wikipedia page table. I want to replace the missing values with the median of each variable.
From the hints given I know that I need to convert the factor
type to numeric
values, and I likely need to use as.numeric(gsub())
.
renew$Hydro[grep('\\s', renew$Hydro)]
as.numeric(gsub('', median(as.numeric(renew$Hydro)), renew$Hydro))
lapply(renew, function(x) as.numeric(gsub('', median(as.numeric(x)), x)))
I tried using grep()
to show that '\\\\s'
is the pattern for extracting spaces, but the spaces were actually excluded from the output and only digits were shown.
When I tried using as.numeric(gsub())
, the output looked like:
[1] 5.415405e+13 5.475475e+13 5.475425e+07 5.475415e+13 5.400000e+01 5.400000e+01 5.435405e+16
[8] 5.425435e+13 5.400000e+01 5.415455e+16 5.445425e+16 5.415495e+13 5.400000e+01 5.400000e+01
which does not at all resemble the data frame which looks like:
[1] 1035.3 7782 72 7109 30134.8 2351.2 15318
I expected the output to look exactly like the original data frame but with the spaces filled in with the column medians.
Edit: This is what the beginning of the data frame looks like. It's from " https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production_from_renewable_sources ".
> renew
Country Hydro Wind Bio Solar
1 Afghanistan 1035.3 0.1 35.5
2 Albania 7782 1.9
3 Algeria 72 19.4 339.1
4 Angola 7109 155 18.3
5 Anguilla 2.4
6 Antigua and Barbuda 5.5
7 Argentina 30134.8 554.1 1820.4 14.5
8 Armenia 2351.2 1.8 1.2
9 Aruba 130.3 8.9 9.2
10 Australia 15318 12199 3722 6209
11 Austria 42919 5235 4603 1096
12 Azerbaijan 1959.3 22.8 174.5 35.3
13 Bahamas 1.9
14 Bahrain 1.2 8.3
15 Bangladesh 946 5.1 7.7 224.3
As you have empty spaces in your dataframe the columns are turned as characters and there is no meaning in taking median
of character columns. We can first replace the empty spaces to NA
, convert the columns to numeric and then replace
NA
s with median
of the column. Using dplyr
we could do the following steps.
library(dplyr)
renew[renew == ""] <- NA
renew %>%
mutate_at(-1, as.numeric) %>% #-1 is to ignore Country column
mutate_at(-1, ~ replace(., is.na(.), median(., na.rm = TRUE)))
# Country Hydro Wind Bio Solar
#1 Afghanistan 1035.3 0.1 174.5 35.5
#2 Albania 7782.0 21.1 174.5 1.9
#3 Algeria 72.0 19.4 174.5 339.1
#4 Angola 7109.0 21.1 155.0 18.3
#5 Anguilla 4730.1 21.1 174.5 2.4
#6 AntiguaandBarbuda 4730.1 21.1 174.5 5.5
#7 Argentina 30134.8 554.1 1820.4 14.5
#8 Armenia 2351.2 1.8 174.5 1.2
#9 Aruba 4730.1 130.3 8.9 9.2
#10 Australia 15318.0 12199.0 3722.0 6209.0
#11 Austria 42919.0 5235.0 4603.0 1096.0
#12 Azerbaijan 1959.3 22.8 174.5 35.3
#13 Bahamas 4730.1 21.1 174.5 1.9
#14 Bahrain 4730.1 1.2 174.5 8.3
#15 Bangladesh 946.0 5.1 7.7 224.3
We could do the same using base R
renew[renew == ""] <- NA
renew[-1] <- lapply(renew[-1], function(x)
as.numeric(replace(x, is.na(x), median(as.numeric(x), na.rm = TRUE))))
We could do this in a compact way with na.aggregate
from zoo
library(dplyr)
library(hablar)
library(zoo)
renew %>%
retype %>% # change the type of columns
# replace missing value of numeric columns with median
mutate_if(is.numeric, na.aggregate, FUN = median)
# A tibble: 15 x 5
# Country Hydro Wind Bio Solar
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Afghanistan 1035. 0.1 174. 35.5
# 2 Albania 7782 21.1 174. 1.9
# 3 Algeria 72 19.4 174. 339.
# 4 Angola 7109 21.1 155 18.3
# 5 Anguilla 4730. 21.1 174. 2.4
# 6 Antigua and Barbuda 4730. 21.1 174. 5.5
# 7 Argentina 30135. 554. 1820. 14.5
# 8 Armenia 2351. 1.8 174. 1.2
# 9 Aruba 4730. 130. 8.9 9.2
#10 Australia 15318 12199 3722 6209
#11 Austria 42919 5235 4603 1096
#12 Azerbaijan 1959. 22.8 174. 35.3
#13 Bahamas 4730. 21.1 174. 1.9
#14 Bahrain 4730. 1.2 174. 8.3
#15 Bangladesh 946 5.1 7.7 224.
renew <- structure(list(Country = c("Afghanistan", "Albania", "Algeria",
"Angola", "Anguilla", "Antigua and Barbuda", "Argentina", "Armenia",
"Aruba", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain",
"Bangladesh"), Hydro = c("1035.3", "7782", "72", "7109", "",
"", "30134.8", "2351.2", "", "15318", "42919", "1959.3", "",
"", "946"), Wind = c("0.1", "", "19.4", "", "", "", "554.1",
"1.8", "130.3", "12199", "5235", "22.8", "", "1.2", "5.1"), Bio = c("",
"", "", "155", "", "", "1820.4", "", "8.9", "3722", "4603", "174.5",
"", "", "7.7"), Solar = c(35.5, 1.9, 339.1, 18.3, 2.4, 5.5, 14.5,
1.2, 9.2, 6209, 1096, 35.3, 1.9, 8.3, 224.3)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"), class = "data.frame")
I'd like to note that the data isn't clean yet just after scraping, since lapply(renew, function(x) grep(",", x))
yields something.
Clean it first with gsub
to avoid these values being converted to NA
s when you convert the data to numeric. Here a one step solution, correct NA
s are created automatically:
renew[-1] <- lapply(renew[-1], function(x) as.numeric(as.character(gsub(",", ".", x))))
After that you could run a sapply
# sapply(2:5, function(x) renew[[x]][is.na(renew[[x]])] <<- median(renew[[x]], na.rm=TRUE))
or of course a shorter adaption of @Ronak Shah's second base R code line, which is a lot better:
renew[-1] <- sapply(renew[-1], function(x) replace(x, is.na(x), median(x, na.rm=TRUE)))
Result
summary(renew)
# country hydro wind bio solar
# Afghanistan : 1 Min. : 0.8 Min. : 0.00 Min. : 0.2 Min. : 0.1
# Albania : 1 1st Qu.: 907.8 1st Qu.: 50.45 1st Qu.: 151.1 1st Qu.: 4.8
# Algeria : 1 Median : 2595.0 Median : 109.00 Median : 242.5 Median : 22.3
# Angola : 1 Mean : 19989.3 Mean : 4324.13 Mean : 2136.3 Mean : 1483.3
# Anguilla : 1 3rd Qu.: 7992.4 3rd Qu.: 293.55 3rd Qu.: 344.4 3rd Qu.: 124.5
# Antigua and Barbuda: 1 Max. :1193370.0 Max. :242387.70 Max. :69017.0 Max. :67874.1
# (Other) :209
Data
library(rvest)
renew <- setNames(html_table(
read_html(paste0("https://en.wikipedia.org/wiki/List_of_countries",
"_by_electricity_production_from_renewable_sources")),
fill=TRUE, header=TRUE)[[1]][c(1, 6:9)], c("country", "hydro", "wind", "bio", "solar"))
renew$country <- factor(renew$country)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.