[英]Conditionally filling in missing values while reshaping a long to wide dataset in R
I am constructing complete timelines of indicators for a set of years and countries on the basis of multiple datasets with varying quality. 我正在基于多个质量不同的数据集构建一组年份和国家的完整指标时间表。
Using reshape2
I have "melted" those datasets into a single dataframe. 使用
reshape2
我将这些数据集“融化”为一个数据帧。
Example dataset: 示例数据集:
d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L,
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE",
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L,
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L,
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5,
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA,
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"),
source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council",
"Eurostat", "OECD"), class = "factor")), .Names = c("cntry",
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA,
-19L))
d
# cntry year indicator sex source
# 1 BE 1960 5.5 male Eurostat
# 2 BE 1970 1.2 male Eurostat
# 3 BE 1980 1.5 male Eurostat
# 4 DE 1960 NA male Eurostat
# 5 DE 1970 1.4 male Eurostat
# 6 GE 1960 NA male Eurostat
# 7 GE 1970 NA male Eurostat
# 8 BE 1960 5.5 male OECD
# 9 BE 1970 1.2 male OECD
# 10 DE 1960 2.3 male OECD
# 11 DE 1970 1.4 male OECD
# 12 GE 1960 NA male OECD
# 13 GE 1970 1.4 male OECD
# 14 BE 1960 NA male Council
# 15 BE 1970 NA male Council
# 16 DE 1960 2.3 male Council
# 17 DE 1970 1.4 male Council
# 18 GE 1970 1.4 male Council
# 19 GE 1980 NA male Council
I was hoping I could uses cast()
with fun.aggregate
to convert this long dataset into the wide format, while selecting the most high quality dataset (Eurostat > OECD > Council) for a given country-year combination to fill in the missings. 我希望我可以使用带有
fun.aggregate
cast()
将这个长数据集转换为宽格式,同时为给定的国家/年组合选择最高质量的数据集(Eurostat> OECD> Council)来填补缺失。 Unfortunately I do not really understand how to work with such a custom aggregate function. 不幸的是,我真的不明白如何使用这样的自定义聚合函数。
In other words, I want to reshape the dataset from a long to a wide format while merging multiple values depending on the value of a factor ("source"). 换句话说,我希望将数据集从长格式重新整形为宽格式,同时根据因子(“源”)的值合并多个值。 Ideally it would work something as:
理想情况下,它可以工作:
full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA NA NA
# GE 1960 NA male Council
# GE 1970 1.4 male OECD
# GE 1980 NA male Council
and optionally (or directly) into the wide format: 并且可选地(或直接)进入宽格式:
# cntry sex 1960 1970 1980
# BE male 5.5 1.2 1.5
# DE male 2.3 1.4 NA
# GE male NA 1.4 NA
Assuming that the data is in the order you require, that is, column source
is ordered first by Eurostat
, then by OECD
and then by council
, I'd go about using data.table
in this manner: 假设数据符合您的要求,即列
source
首先由Eurostat
订购,然后由OECD
订购,然后由council
订购,我data.table
这种方式使用data.table
:
require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator",
subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))
# cntry sex 1960 1970 1980
# 1: BE male 5.5 1.2 1.5
# 2: DE male 2.3 1.4 NA
# 3: GE male NA 1.4 NA
I am not sure if this meets all of your expectations, but it sounds like you're looking for something like the following: 我不确定这是否满足您的所有期望,但听起来您正在寻找以下内容:
toMerge <- expand.grid(cntry = c("BE", "DE", "GE"),
year = c(1960, 1970, 1980),
source = c("Eurostat", "OECD", "Council"),
sex = "male")
d2 <- merge(d, toMerge, all = TRUE)
d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
d2 <- d2[order(d2$source, decreasing=TRUE), ]
Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")],
FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
D <- d2[Rank == 1, ]
D
# cntry year sex source indicator
# 2 BE 1960 male Eurostat 5.5
# 5 BE 1970 male Eurostat 1.2
# 8 BE 1980 male Eurostat 1.5
# 14 DE 1970 male Eurostat 1.4
# 17 DE 1980 male Eurostat NA
# 20 GE 1960 male Eurostat NA
# 26 GE 1980 male Eurostat NA
# 12 DE 1960 male OECD 2.3
# 24 GE 1970 male OECD 1.4
library(reshape2)
dcast(D, cntry ~ year, value.var="indicator")
# cntry 1960 1970 1980
# 1 BE 5.5 1.2 1.5
# 2 DE 2.3 1.4 NA
# 3 GE NA 1.4 NA
Perhaps the following could work as well: 也许以下内容也可以起作用:
library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)
The source selection is made using a layered ifelse
statement. 使用分层
ifelse
语句进行源选择。 The indication of the source selected is lost with this approach, if that is an issue, a similar ifelse
statement can be added, creating the source origin variable: 使用此方法会丢失所选源的指示,如果这是一个问题,可以添加类似的
ifelse
语句,从而创建源origin变量:
y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")
Here is another option: 这是另一种选择:
library(reshape2)
d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
d2 <- d[1:4]
d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
dcast(
d2, cntry + sex ~ year, value.var="indicator",
fun.aggregate=function(x) {
if(!length(x)) return(NA_real_)
xs <- do.call(rbind, x)
xs <- xs[complete.cases(xs), ]
if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
} )
Produces: 生产:
cntry sex 1960 1970 1980
1 BE male 105.5 101.2 101.5
2 DE male 2.3 101.4 NA
3 GE male NA 1.4 NA
Note I added 100 to "Eurostat" value to make them distinguishable from the others since in this sample set they seemed to be equal. 注意我在“Eurostat”值中添加了100,以使它们与其他值区别开来,因为在此示例集中它们似乎相等。
Basically, we cheat by turning the indicator
column into a column of list items containing both the indicator and the source, and then we use fun.aggregate
to pick the item from each group with the lowest source value (note we reset the factors so the most desirable source has the lowest level). 基本上,我们通过将
indicator
列转换为包含指示器和源的列表项列来作弊,然后我们使用fun.aggregate
从具有最低源值的每个组中选择项目(注意我们重置因子以便最理想的来源具有最低水平)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.