I have a data frame as follows:
nearby_ids <- NULL
for (i in 1:10){
string <- paste(as.character(sample(setdiff(1:10,i), sample(setdiff(1:10,i)))), collapse = ",")
nearby_ids <- c(nearby_ids, string)}
my_df <- data.frame(school_id=1:10, classes=sample(1:50, 10), nearby_schools_id = nearby_ids, stringsAsFactors = FALSE)
This is how it looks:
The variables "school_id" and "classes" are integers, and nearby_schools_id is character.
What I want is the following (hopefully without going through loops):
For each row, I want to take the nearby_schools_ids, use them as indices to subset the dataframe, and for that subsetted dataframe I want to sum over "classes".
The idea is, I want to know the total number of classes for all nearby schools.
Expectation: So for row 1 for example, I want to output 122 (= 46+8+44+24).
I know I need to use strsplit
here. But I'm trying to avoid looping and apply()ing (I have some 3 million rows, and I want the most efficient way possible). Immediately when I implement strsplit(my_df$nearby_schools_id, ",")
I get back a list of vectors, which is making things slightly more complicated to do.
Is there a vectorization solution for this? What is the best way to solve it?
Any help is appreciated
Similar to @Ronak's logic, but the matching procedure can be done in bulk.
Updated now to take account of empty list of nearby schools
spl <- strsplit(my_df$nearby_schools_id, ",", fixed=TRUE)
sa <- seq_along(spl)
my_df$result <- tapply(
my_df$classes[match(unlist(spl),my_df$school_id)],
factor(rep(sa, lengths(spl)), levels=sa),
FUN=sum
)
Testing on 3 million rows:
my_df <- my_df[rep(1:10,3e5),]
my_df$school_id <- 1:3e6
system.time({
spl <- strsplit(my_df$nearby_schools_id, ",", fixed=TRUE)
tapply(
my_df$classes[match(unlist(spl),my_df$school_id)],
rep(seq_along(spl), lengths(spl)),
FUN=sum
)
})
## user system elapsed
## 10.206 0.492 10.698
I don't think you can actually do this without any kind of splitting. Try this approach:
my_df$result <- sapply(strsplit(my_df$nearby_schools_id, ','), function(x)
sum(my_df$classes[as.numeric(x)]))
If your data is not sorted by school id's or if you don't have continuous sequence of id's you can use match
to get the correct id's.
my_df$result <- sapply(strsplit(my_df$nearby_schools_id, ','), function(x)
sum(my_df$classes[match(as.numeric(x), my_df$school_id)]))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.