简体   繁体   中英

Using apply for calculating subscale and total scores across multiple dataframes

I would like to set up two functions to automate the calculation of subscale and total scores across multiple dataframes which resemble datasets for various timepoints. I have considered various similar questions here, but haven't found a proper solution yet.

I managed to do the calculations manually, however, I am struggling in automating the calculation of subscale scores and total scores (from the subscale scores) for other timepoints available from other dfs using the apply function - I hope lapply is the correct one for this purpose.

Some random data to demonstrate the problem:

set.seed(1)
df1 <- data.frame(matrix(sample(32), ncol = 8))
names(df1) <- paste(rep(c("a", "b"), each = 4), 1:4, sep = "")

set.seed(2)
df2 <- data.frame(matrix(sample(32), ncol = 8))
names(df2) <- paste(rep(c("a", "b"), each = 4), 1:4, sep = "")

To account for potential NAs and the varying corresponding number of valid data, the manual calculation of the subscale and total scores looks as follows. For the calculation of the total score, I am also referring to rowSums, since in the real data, there are more than two subscales which constitute the total score and the subscale scores are next to each other within each df.

df1$sub1 <- rowSums(subset(df1, select=a1:a4), na.rm = TRUE) * ncol(subset(df1, select=a1:a4)) / 
rowSums(!is.na(subset(df1, select=a1:a4)))

df1$sub2 <- rowSums(subset(df1, select=b1:b4), na.rm = TRUE) * ncol(subset(df1, select=b1:b4)) / 
rowSums(!is.na(subset(df1, select=b1:b4)))

df1$total <- rowSums(subset(df1, select=sub1:sub2))

df1
df2

My idea on trying to iterate over multiple dataframes, was the following:

#Set up a list for the dfs 
dflist <- list(df1, df2)

#Define columns for subscale and total score calculation within each df 
subrange <- list(select(dflist, c(a1:a4, b1:b4)))
totalrange <- list(select(dflist, c(sub1, sub2)))

That's where the trouble starts – it returns a request asking for a selection

#Set up functions for the subscale scores and total scores 
subscalefun <- function() { 
rowSums(subset(dflist, select=subrange), na.rm = TRUE) * ncol(subset(dflist, select= subrange)) / 
rowSums(!is.na(subset(dflist, select= subrange)))
}

totalfun <- function() {
rowSums(subset(dflist, select=totalrange))
}

These functions are just thought as an approach to show what I try to accomplish. I am sure there should also be a paste argument included to be writing the results to the respective df.

#Using lapply for calculation of subscale and total scores across dfs defined in dflist
lapply (dflist, subscalefun)
lapply (dflist, totalfun)

Some assistance on how to approach this task would be highly appreciated. Maybe someone also can give a good advice on how to improve in functional programming (ie getting from simple functions frequently introduced in tutorials to programming more complex, custom functions and obtaining the proper "vocabulary" for that).

Translating code to functions is easier for me to start off with mirroring the original code. So the code you would start off with would be:

DF$sub1 <- rowSums(...)
DF$sub2 <- rowSums(...)
DF$total <- rowSums(...)

You were on the right track with the idea of lapply() . I'm going to use an anonymous function within lapply() :

lapply(dflist
       , function(DF) {
         DF$sub1 <- rowSums(subset(DF, select = a1:a4), na.rm = TRUE)
         DF$sub2 <- rowSums(subset(DF, select = b1:b4), na.rm = TRUE)
         DF$total <- rowSums(subset(DF, select=sub1:sub2))

         return(DF)
       } 
       )

[[1]]
  a1 a2 a3 a4 b1 b2 b3 b4 sub1 sub2 total
1  9  6 16 14 31 24 13 21   45   89   134
2 12 25  2  8 15  3 19 22   47   59   106
3 18 29  5 20 28  7  1 30   72   66   138
4 27 17  4 32 11 23 26 10   80   70   150

[[2]]
  a1 a2 a3 a4 b1 b2 b3 b4 sub1 sub2 total
1  6 27 12 16 20 30  3 14   61   67   128
2 22 26 13 28 19 29 17 25   89   90   179
3 18  4 23  8  7  9 31 24   53   71   124
4  5 21 32 15  1  2 10 11   73   24    97

This does not modify anything so you would have to do dflist <- lapply(dflist, ...) if you would want to keep it saved.

One thing that isn't great about this approach is that we'd have to copy and paste a1:a4 for however many letters there are in your dataset. Since the pattern is [letter][number] , we can look at the unique first characters in the dataset:

starting_letters <- unique(substring(names(df2), 1, 1))
starting_letters
[1] "a" "b"

And we can loop through the starting_letters vector to do get the subtotals with grep giving the column numbers that match the starting_letters :

lapply(starting_letters, function(nam) rowSums(df2[, grep(nam, names(df2))], na.rm = T))

[[1]]
[1] 61 89 53 73

[[2]]
[1] 67 90 71 24

We can also determine how many sub# there are going to be based on the length of the starting_letters vector:

subm_names <- paste0("sub", seq_len(length(starting_letters)))
subm_names
[1] "sub1" "sub2

And putting it all together:

lapply(dflist
       , function(DF) {
         start_letters <- unique(substring(names(DF), 1, 1))
         sub_names <- paste0("sub", seq_len(length(start_letters)))
         DF[sub_names] <- lapply(start_letters
                                 , function(let) {
                                   match_names <- grep(let, names(DF))
                                   rowSums(DF[, match_names], na.rm = T) / length(match_names) * rowSums(!is.na(DF[, match_names]))
                                 }
         )
         # DF[sub_names] <- lapply(start_letters
                                # , function(nam) rowSums(DF[, grep(nam, names(DF))], na.rm = T))
         DF$total <- rowSums(DF[sub_names])

         # DF$sub1 <- rowSums(subset(DF, select = a1:a4), na.rm = TRUE)
         # DF$sub2 <- rowSums(subset(DF, select = b1:b4), na.rm = TRUE)
         # DF$total <- rowSums(subset(DF, select=sub1:sub2))
         return(DF)
       } 
       )

The advantages of this approach is that it is more dynamic. If one data.frame in the list only as the a group, it wouldn't error out. Similarly, it will scale up to data.frame s with more letter groupings or number groupings.

Here is a solution using dplyr . This is a common problem in psych/health research. I would assume that each of your dataframes includes an ID variable (ie, each row is a unique case) and each dataframe represents a unique timepoint. This approach would work if you had more timpoints (ie, df3, df4) and more subscales (c, d, e), you would just need to adapt the code accordingly.

# generate sample data
df1 <- data.frame(matrix(sample(32), ncol = 8))
names(df1) <- paste(rep(c("a", "b"), each = 4), 1:4, sep = "")

set.seed(2)
df2 <- data.frame(matrix(sample(32), ncol = 8))
names(df2) <- paste(rep(c("a", "b"), each = 4), 1:4, sep = "")

# add id's and timepoint
df1 <- df1 %>% mutate(id=row_number(),time=1)
df2 <- df2 %>% mutate(id=row_number(),time=2)

# gather data, extract subscale name, calculate totals, join to original data
rbind(df1,df2) %>% gather(k,v,-id,-time) %>% 
  mutate(v=ifelse(v>28,NA,v)) %>% # add some NAs
  mutate(scale=sub('([a-z])[0-9]','\\1',k)) %>% 
  group_by(id,time,scale) %>% 
  summarise(sub.total=mean(v,na.rm=1)*n()) %>% 
  spread(scale,sub.total) %>% mutate(total=a+b) %>% 
  left_join(rbind(df1,df2),.) # original data will not show added NA's

  a1 a2 a3 a4 b1 b2 b3 b4 id time        a        b     total
1 10 27 29 24  4 19  6 18  1    1 81.33333 47.00000 128.33333
2 25  2 11 31  1  8 20 15  2    1 50.66667 44.00000  94.66667
3 13 14 22 28  5  7 17 12  3    1 77.00000 41.00000 118.00000
4 26 23 32 16 30  9  3 21  4    1 86.66667 44.00000 130.66667
5  6 27 12 16 20 30  3 14  1    2 61.00000 49.33333 110.33333
6 22 26 13 28 19 29 17 25  2    2 89.00000 81.33333 170.33333
7 18  4 23  8  7  9 31 24  3    2 53.00000 53.33333 106.33333
8  5 21 32 15  1  2 10 11  4    2 54.66667 24.00000  78.66667

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM