简体   繁体   中英

Conditionally filling in missing values while reshaping a long to wide dataset in R

I am constructing complete timelines of indicators for a set of years and countries on the basis of multiple datasets with varying quality.

Using reshape2 I have "melted" those datasets into a single dataframe.

Example dataset:

d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE", 
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L, 
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5, 
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA, 
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"), 
    source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council", 
    "Eurostat", "OECD"), class = "factor")), .Names = c("cntry", 
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA, 
-19L))


d
#    cntry year indicator  sex   source
# 1     BE 1960       5.5 male Eurostat
# 2     BE 1970       1.2 male Eurostat
# 3     BE 1980       1.5 male Eurostat
# 4     DE 1960        NA male Eurostat
# 5     DE 1970       1.4 male Eurostat
# 6     GE 1960        NA male Eurostat
# 7     GE 1970        NA male Eurostat
# 8     BE 1960       5.5 male     OECD
# 9     BE 1970       1.2 male     OECD
# 10    DE 1960       2.3 male     OECD
# 11    DE 1970       1.4 male     OECD
# 12    GE 1960        NA male     OECD
# 13    GE 1970       1.4 male     OECD
# 14    BE 1960        NA male  Council
# 15    BE 1970        NA male  Council
# 16    DE 1960       2.3 male  Council
# 17    DE 1970       1.4 male  Council
# 18    GE 1970       1.4 male  Council
# 19    GE 1980        NA male  Council

I was hoping I could uses cast() with fun.aggregate to convert this long dataset into the wide format, while selecting the most high quality dataset (Eurostat > OECD > Council) for a given country-year combination to fill in the missings. Unfortunately I do not really understand how to work with such a custom aggregate function.

In other words, I want to reshape the dataset from a long to a wide format while merging multiple values depending on the value of a factor ("source"). Ideally it would work something as:

full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA  NA   NA
# GE 1960 NA  male Council 
# GE 1970 1.4 male OECD
# GE 1980 NA  male Council

and optionally (or directly) into the wide format:

# cntry  sex 1960 1970 1980
#    BE male  5.5  1.2  1.5
#    DE male  2.3  1.4  NA
#    GE male   NA  1.4  NA

Assuming that the data is in the order you require, that is, column source is ordered first by Eurostat , then by OECD and then by council , I'd go about using data.table in this manner:

require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator", 
 subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))

#    cntry  sex 1960 1970 1980
# 1:    BE male  5.5  1.2  1.5
# 2:    DE male  2.3  1.4   NA
# 3:    GE male   NA  1.4   NA

I am not sure if this meets all of your expectations, but it sounds like you're looking for something like the following:

toMerge <- expand.grid(cntry = c("BE", "DE", "GE"), 
                       year = c(1960, 1970, 1980), 
                       source = c("Eurostat", "OECD", "Council"), 
                       sex = "male")
d2 <- merge(d, toMerge, all = TRUE)

d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
d2 <- d2[order(d2$source, decreasing=TRUE), ]
Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")], 
                 FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
D <- d2[Rank == 1, ]
D
#    cntry year  sex   source indicator
# 2     BE 1960 male Eurostat       5.5
# 5     BE 1970 male Eurostat       1.2
# 8     BE 1980 male Eurostat       1.5
# 14    DE 1970 male Eurostat       1.4
# 17    DE 1980 male Eurostat        NA
# 20    GE 1960 male Eurostat        NA
# 26    GE 1980 male Eurostat        NA
# 12    DE 1960 male     OECD       2.3
# 24    GE 1970 male     OECD       1.4

library(reshape2)
dcast(D, cntry ~ year, value.var="indicator")
#   cntry 1960 1970 1980
# 1    BE  5.5  1.2  1.5
# 2    DE  2.3  1.4   NA
# 3    GE   NA  1.4   NA

Perhaps the following could work as well:

library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)

The source selection is made using a layered ifelse statement. The indication of the source selected is lost with this approach, if that is an issue, a similar ifelse statement can be added, creating the source origin variable:

y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")

Here is another option:

library(reshape2)
d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
d2 <- d[1:4]
d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
dcast(
  d2, cntry + sex ~ year, value.var="indicator", 
  fun.aggregate=function(x) {
    if(!length(x)) return(NA_real_)
    xs <- do.call(rbind, x)
    xs <- xs[complete.cases(xs), ]
    if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
} )

Produces:

  cntry  sex  1960  1970  1980
1    BE male 105.5 101.2 101.5
2    DE male   2.3 101.4    NA
3    GE male    NA   1.4    NA

Note I added 100 to "Eurostat" value to make them distinguishable from the others since in this sample set they seemed to be equal.

Basically, we cheat by turning the indicator column into a column of list items containing both the indicator and the source, and then we use fun.aggregate to pick the item from each group with the lowest source value (note we reset the factors so the most desirable source has the lowest level).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM