简体   繁体   English

在R中重新整形长到宽的数据集时,有条件地填充缺失值

[英]Conditionally filling in missing values while reshaping a long to wide dataset in R

I am constructing complete timelines of indicators for a set of years and countries on the basis of multiple datasets with varying quality. 我正在基于多个质量不同的数据集构建一组年份和国家的完整指标时间表。

Using reshape2 I have "melted" those datasets into a single dataframe. 使用reshape2我将这些数据集“融化”为一个数据帧。

Example dataset: 示例数据集:

d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE", 
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L, 
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5, 
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA, 
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"), 
    source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council", 
    "Eurostat", "OECD"), class = "factor")), .Names = c("cntry", 
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA, 
-19L))


d
#    cntry year indicator  sex   source
# 1     BE 1960       5.5 male Eurostat
# 2     BE 1970       1.2 male Eurostat
# 3     BE 1980       1.5 male Eurostat
# 4     DE 1960        NA male Eurostat
# 5     DE 1970       1.4 male Eurostat
# 6     GE 1960        NA male Eurostat
# 7     GE 1970        NA male Eurostat
# 8     BE 1960       5.5 male     OECD
# 9     BE 1970       1.2 male     OECD
# 10    DE 1960       2.3 male     OECD
# 11    DE 1970       1.4 male     OECD
# 12    GE 1960        NA male     OECD
# 13    GE 1970       1.4 male     OECD
# 14    BE 1960        NA male  Council
# 15    BE 1970        NA male  Council
# 16    DE 1960       2.3 male  Council
# 17    DE 1970       1.4 male  Council
# 18    GE 1970       1.4 male  Council
# 19    GE 1980        NA male  Council

I was hoping I could uses cast() with fun.aggregate to convert this long dataset into the wide format, while selecting the most high quality dataset (Eurostat > OECD > Council) for a given country-year combination to fill in the missings. 我希望我可以使用带有fun.aggregate cast()将这个长数据集转换为宽格式,同时为给定的国家/年组合选择最高质量的数据集(Eurostat> OECD> Council)来填补缺失。 Unfortunately I do not really understand how to work with such a custom aggregate function. 不幸的是,我真的不明白如何使用这样的自定义聚合函数。

In other words, I want to reshape the dataset from a long to a wide format while merging multiple values depending on the value of a factor ("source"). 换句话说,我希望将数据集从长格式重新整形为宽格式,同时根据因子(“源”)的值合并多个值。 Ideally it would work something as: 理想情况下,它可以工作:

full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA  NA   NA
# GE 1960 NA  male Council 
# GE 1970 1.4 male OECD
# GE 1980 NA  male Council

and optionally (or directly) into the wide format: 并且可选地(或直接)进入宽格式:

# cntry  sex 1960 1970 1980
#    BE male  5.5  1.2  1.5
#    DE male  2.3  1.4  NA
#    GE male   NA  1.4  NA

Assuming that the data is in the order you require, that is, column source is ordered first by Eurostat , then by OECD and then by council , I'd go about using data.table in this manner: 假设数据符合您的要求,即列source首先由Eurostat订购,然后由OECD订购,然后由council订购,我data.table这种方式使用data.table

require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator", 
 subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))

#    cntry  sex 1960 1970 1980
# 1:    BE male  5.5  1.2  1.5
# 2:    DE male  2.3  1.4   NA
# 3:    GE male   NA  1.4   NA

I am not sure if this meets all of your expectations, but it sounds like you're looking for something like the following: 我不确定这是否满足您的所有期望,但听起来您正在寻找以下内容:

toMerge <- expand.grid(cntry = c("BE", "DE", "GE"), 
                       year = c(1960, 1970, 1980), 
                       source = c("Eurostat", "OECD", "Council"), 
                       sex = "male")
d2 <- merge(d, toMerge, all = TRUE)

d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
d2 <- d2[order(d2$source, decreasing=TRUE), ]
Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")], 
                 FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
D <- d2[Rank == 1, ]
D
#    cntry year  sex   source indicator
# 2     BE 1960 male Eurostat       5.5
# 5     BE 1970 male Eurostat       1.2
# 8     BE 1980 male Eurostat       1.5
# 14    DE 1970 male Eurostat       1.4
# 17    DE 1980 male Eurostat        NA
# 20    GE 1960 male Eurostat        NA
# 26    GE 1980 male Eurostat        NA
# 12    DE 1960 male     OECD       2.3
# 24    GE 1970 male     OECD       1.4

library(reshape2)
dcast(D, cntry ~ year, value.var="indicator")
#   cntry 1960 1970 1980
# 1    BE  5.5  1.2  1.5
# 2    DE  2.3  1.4   NA
# 3    GE   NA  1.4   NA

Perhaps the following could work as well: 也许以下内容也可以起作用:

library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)

The source selection is made using a layered ifelse statement. 使用分层ifelse语句进行源选择。 The indication of the source selected is lost with this approach, if that is an issue, a similar ifelse statement can be added, creating the source origin variable: 使用此方法会丢失所选源的指示,如果这是一个问题,可以添加类似的ifelse语句,从而创建源origin变量:

y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")

Here is another option: 这是另一种选择:

library(reshape2)
d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
d2 <- d[1:4]
d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
dcast(
  d2, cntry + sex ~ year, value.var="indicator", 
  fun.aggregate=function(x) {
    if(!length(x)) return(NA_real_)
    xs <- do.call(rbind, x)
    xs <- xs[complete.cases(xs), ]
    if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
} )

Produces: 生产:

  cntry  sex  1960  1970  1980
1    BE male 105.5 101.2 101.5
2    DE male   2.3 101.4    NA
3    GE male    NA   1.4    NA

Note I added 100 to "Eurostat" value to make them distinguishable from the others since in this sample set they seemed to be equal. 注意我在“Eurostat”值中添加了100,以使它们与其他值区别开来,因为在此示例集中它们似乎相等。

Basically, we cheat by turning the indicator column into a column of list items containing both the indicator and the source, and then we use fun.aggregate to pick the item from each group with the lowest source value (note we reset the factors so the most desirable source has the lowest level). 基本上,我们通过将indicator列转换为包含指示器和源的列表项列来作弊,然后我们使用fun.aggregate从具有最低源值的每个组中选择项目(注意我们重置因子以便最理想的来源具有最低水平)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM