简体   繁体   中英

R - How to subset and sum over a dataframe with list of vectors that contain ids?

I have a data frame as follows:

nearby_ids <- NULL

for (i in 1:10){
string <- paste(as.character(sample(setdiff(1:10,i), sample(setdiff(1:10,i)))), collapse = ",")
nearby_ids <- c(nearby_ids, string)}

my_df <- data.frame(school_id=1:10, classes=sample(1:50, 10), nearby_schools_id = nearby_ids, stringsAsFactors = FALSE)

This is how it looks:

结果数据框

The variables "school_id" and "classes" are integers, and nearby_schools_id is character.

What I want is the following (hopefully without going through loops):

For each row, I want to take the nearby_schools_ids, use them as indices to subset the dataframe, and for that subsetted dataframe I want to sum over "classes".

The idea is, I want to know the total number of classes for all nearby schools.

Expectation: So for row 1 for example, I want to output 122 (= 46+8+44+24).

I know I need to use strsplit here. But I'm trying to avoid looping and apply()ing (I have some 3 million rows, and I want the most efficient way possible). Immediately when I implement strsplit(my_df$nearby_schools_id, ",") I get back a list of vectors, which is making things slightly more complicated to do.

Is there a vectorization solution for this? What is the best way to solve it?

Any help is appreciated

Similar to @Ronak's logic, but the matching procedure can be done in bulk.
Updated now to take account of empty list of nearby schools

spl <- strsplit(my_df$nearby_schools_id, ",", fixed=TRUE)
sa <- seq_along(spl)
my_df$result <- tapply(
    my_df$classes[match(unlist(spl),my_df$school_id)],
    factor(rep(sa, lengths(spl)), levels=sa),
    FUN=sum
)

Testing on 3 million rows:

my_df <- my_df[rep(1:10,3e5),]
my_df$school_id <- 1:3e6

system.time({
spl <- strsplit(my_df$nearby_schools_id, ",", fixed=TRUE)
tapply(
    my_df$classes[match(unlist(spl),my_df$school_id)],
    rep(seq_along(spl), lengths(spl)),
    FUN=sum
)
})
##   user  system elapsed 
## 10.206   0.492  10.698

I don't think you can actually do this without any kind of splitting. Try this approach:

my_df$result <- sapply(strsplit(my_df$nearby_schools_id, ','), function(x) 
                       sum(my_df$classes[as.numeric(x)]))

If your data is not sorted by school id's or if you don't have continuous sequence of id's you can use match to get the correct id's.

my_df$result <- sapply(strsplit(my_df$nearby_schools_id, ','), function(x)
                  sum(my_df$classes[match(as.numeric(x), my_df$school_id)]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM