[英]How can I aggregate multiple columns in a data.frame with a custom function in R?
I've got a data.frame dt
with some duplicate keys and missing data, ie 我有一个带有一些重复键和缺少数据的data.frame
dt
,即
Name Height Weight Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA
In this case the key is the name, and I would like to apply to each column a function like 在这种情况下,键是名称,我想在每列中应用一个函数
f <- function(x){
x <- x[!is.na(x)]
x <- x[1]
return(x)
}
while aggregating by the key (ie, the "Name" column), so as to obtain as a result 在通过密钥(即“名称”列)聚合时,以便获得结果
Name Height Weight Age
Alice 180 70 35
Bob NA 80 27
Charles 170 75 NA
I tried 我试过了
dt_agg <- aggregate(. ~ Name,
data = dt,
FUN = f)
and I got some errors, then I tried the following 我有一些错误,然后我尝试了以下
dt_agg_1 <- aggregate(Height ~ Name,
data = dt,
FUN = f)
dt_agg_2 <- aggregate(Weight ~ Name,
data = dt,
FUN = f)
and this time it worked. 这次它奏效了。
Since I have 50 columns, this second approach is quite cumbersome for me. 由于我有50列,第二种方法对我来说非常麻烦。 Is there a way to fix the first approach?
有没有办法解决第一种方法?
Thanks for help! 感谢帮助!
You were very close with the aggregate
function, you needed to adjust how aggregate handles NA
(from na.omit
to na.pass
). 你非常接近
aggregate
函数,你需要调整聚合处理NA
(从na.omit
到na.pass
)。 My guess is that aggregate removes all rows with NA first and then does its aggregating, instead of removing NAs as aggregate iterates over the columns to be aggregated. 我的猜测是聚合首先删除NA的所有行,然后进行聚合,而不是删除NAs,因为聚合迭代要聚合的列。 Since your example dataframe you have an
NA
in each row you end up with a 0-row dataframe (which is the error I was getting when running your code). 由于您的示例数据帧在每行中都有一个
NA
,因此最终会得到一个0行数据帧(这是我在运行代码时遇到的错误)。 I tested this by removing all but one NA and your code works as-is. 我通过删除除了一个NA以外的所有NA来测试它,并且您的代码按原样运行。 So we set
na.action = na.pass
to pass the NA's through. 所以我们设置
na.action = na.pass
来传递NA。
dt_agg <- aggregate(. ~ Name,
data = dt,
FUN = f, na.action = "na.pass")
dt_agg <- aggregate(dt[, -1],
by = list(dt$Name),
FUN = f)
dt_agg
# Group.1 Height Weight Age
# 1 Alice 180 70 35
# 2 Bob NA 80 27
# 3 Charles 170 75 NA
You can do this with dplyr
: 您可以使用
dplyr
执行此dplyr
:
library(dplyr)
df %>%
group_by(Name) %>%
summarize_all(funs(sort(.)[1]))
Result: 结果:
# A tibble: 3 x 4
Name Height Weight Age
<fctr> <int> <int> <int>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA
Data: 数据:
df = read.table(text = "Name Height Weight Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA", header = TRUE)
Here is an option with data.table
这是
data.table
一个选项
library(data.table)
setDT(df)[, lapply(.SD, function(x) head(sort(x), 1)), Name]
# Name Height Weight Age
#1: Alice 180 70 35
#2: Bob NA 80 27
#3: Charles 170 75 NA
Simply, add na.action=na.pass
in aggregate()
call: 只需在
aggregate()
调用中添加na.action=na.pass
:
aggdf <- aggregate(.~Name, data=df, FUN=f, na.action=na.pass)
# Name Height Weight Age
# 1 Alice 180 70 35
# 2 Bob NA 80 27
# 3 Charles 170 75 NA
If you add an ifelse()
to your function to make sure the function returns a value if all values are NA
: 如果向函数添加
ifelse()
以确保函数在所有值都为NA
返回值:
f <- function(x) {
x <- x[!is.na(x)]
ifelse(length(x) == 0, NA, x)
}
You can use dplyr
to aggregate: 您可以使用
dplyr
进行聚合:
library(dplyr)
dt %>% group_by(Name) %>% summarise_all(funs(f))
This returns: 返回:
# A tibble: 3 x 4
Name Height Weight Age
<fctr> <dbl> <dbl> <dbl>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.