Is it possible to aggregate with a custom function that uses two columns to return one column?
Say I have a dataframe:
x <- c(2,4,3,1,5,7)
y <- c(3,2,6,3,4,6)
group <- c("A","A","A","A","B","B")
data <- data.frame(group, x, y)
data
# group x y
# 1 A 2 3
# 2 A 4 2
# 3 A 3 6
# 4 A 1 3
# 5 B 5 4
# 6 B 7 6
And I have my function that I want to use on two columns (x and y):
pathlength <- function(xy) {
out <- as.matrix(dist(xy))
sum(out[row(out) - col(out) == 1])
}
I tried the following with aggregate:
out <- aggregate(cbind(x, y) ~ group, data, FUN = pathlength)
out <- aggregate(cbind(x, y) ~ group, data, function(x) pathlength(x))
However, this calls pathlength on x and y separately instead of together, giving me:
# group x y
#1 A 5 8
#2 B 2 2
What I want is it to call pathlength on x and y together and aggregate it this way. Here is what I want aggregate to do:
realA <- matrix(c(2,4,3,1,3,2,6,3), nrow=4, ncol=2)
pathlength(realA)
# [1] 9.964725
realB <- matrix(c(5,7,4,6), nrow=2, ncol=2)
pathlength(realB)
# [1] 2.828427
group <- c("A", "B")
pathlength <- c(9.964725,2.828427)
real_out <- data.frame(group, pathlength)
real_out
# group pathlength
# 1 A 9.964725
# 2 B 2.828427
Does anyone have any suggestions? Or is there some other function that I can't find on google that will let me do this? I'd rather not work around this using a for loop, as I'm assuming it will be slow for a big dataset.
As you've found out, the base aggregate()
function only works on one column at a time. Instead you could use the by()
function
by(data[,c("x","y")], data$group, pathlength)
data$group: A
[1] 9.964725
-----------------------------------------------------------------------
data$group: B
[1] 2.828427
or split()/lapply()
lapply(split(data[,c("x","y")], data$group), pathlength)
$A
[1] 9.964725
$B
[1] 2.828427
As pointed out by @BrodieG, this is easily done with "data.table":
> as.data.table(data)[, pathlength(.SD), by = group]
group V1
1: A 9.964725
2: B 2.828427
You can consider making the matrix
input "on-the-fly" in "data.table":
library(data.table)
as.data.table(data)[, pathlength(matrix(unlist(.SD), ncol = length(.SD))), by = group]
# group V1
# 1: A 9.964725
# 2: B 2.828427
As such, you can also consider making a helper function, like the following, that would create the matrix for you:
sdmat <- function(sd) matrix(unlist(sd), ncol = length(sd))
Then, you can do:
as.data.table(data)[, pathlength(sdmat(.SD)), by = group]
# group V1
# 1: A 9.964725
# 2: B 2.828427
Or even:
as.data.table(data)[, pathlength(sdmat(list(x, y))), by = group]
# group V1
# 1: A 9.964725
# 2: B 2.828427
Alternatively, you can try "dplyr":
library(dplyr)
data %>%
group_by(group) %>%
summarise(pathlength = pathlength(matrix(c(x, y), ncol = 2)))
# Source: local data frame [2 x 2]
#
# group pathlength
# 1 A 9.964725
# 2 B 2.828427
Alternatively, you can covert the data into a "long" format and then use your favorite aggregation function.
Here's a continuation with "dplyr":
library(dplyr)
library(tidyr)
data %>%
gather(var, val, -group) %>%
group_by(group) %>%
summarise(pathlength = pathlength(matrix(val, ncol = length(unique(var)))))
# Source: local data frame [2 x 2]
#
# group pathlength
# 1 A 9.964725
# 2 B 2.828427
If anyone wants another easy solution, I ended up using ddply. Turns out you can use a function on multiple columns with ddply, unlike with aggregate.
Here's the code:
out <- ddply(data, "group", summarise,
pathlength = pathlength(cbind(x,y)))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.