How do I subset multiple data frames using multiple variables

Question

I want to subset five data frames based on a series of 31 variables. The data frames are stored in a list:

long_data_sets <- list(scale_g1, scale_g2, scale_g3, scale_g4, scale_g5)

Each of the five data frames includes the identical set of columns, among others, 31 factors called "speeder_225" through "speeder_375":

> str(scale_g1[53:83])
'data.frame':   5522 obs. of  31 variables:
$ speeder_225: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_230: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_235: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_240: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_245: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ... 
...

I want to subset the data frames based on one of the 31 factor variables at a time, so that I end up with 5*31 new data frames.

I created the function for subsetting that retains only two columns that I need going forward ("direction" and "response"):

create_speeder_data <- function(x, y){
  df <- subset(x, x[,y] == "Speeder",
              select = c("direction", "response"))
}

This allows me to create one new data frame at a time:

create_speeder_data(scale_g1, "speeder_225")

I tried to apply the function using map2() and the list of 5 data frames and a list of the 31 factor names, but this obviously did not work.

> speeder_var <- names(scale_g1[53:83])
> map2(long_data_sets, speeder_var, create_speeder_data)
Error: `.x` (5) and `.y` (31) are different lengths

The closest I could get was to take out the y argument from my function and apply the function to the list of five data frames for one of the 31 factors.

#Create subsetting function for "speeder_225"
create_speeder_225_data <- function(x){
  df <- subset(x, x$speeder_225 == "Speeder",
               select = c("direction", "response"))
}

#Map function to list of data frames
z_speeder_225 <- map(long_data_sets, create_speeder_225_data)

#Change names of new data frames in list
names(long_data_sets) <- c("g1", "g2", "g3", "g4", "g5")
names(z_speeder_225) <- paste0(names_long_data_sets, "speeder_225")

#Get data frames from list
list2env(z_speeder_225, envir=.GlobalEnv)

I would need to repeat this 30 more times to get to my 5*31 data frames. There must be an easier way to do that.

Any help is much appreciated!

Answer 1

I agree with @Mako212 - you may need to reconsider what you're trying to do. However, here's something that ought to work.

the below code will subset each dataset in the list. In the test data there are 5 categorical variables each with two levels. Since the otuput is based on 1 level only ( speeding ), the output will be 5 x 5 = 25 datasets. This is organized as a list of lists (5 x 5):

library(data.table)

# Creating some dummy data
k  <- 100
directions <- as.vector(sapply(c('North', 'West', 'South', 'East'), function (z) return(rep(z, k))))
speeding <- as.vector(sapply(c('speeding', 'not-speeding'), function (z) return(rep(z, k))))

# Test data - number_of_observations <= 4*k
createDataTable <- function(number_of_observations = 50){
  dt <- data.table(direction = sample(x = directions, size = number_of_observations, replace = T), 
                   speeder1 = sample(x = speeding, size = number_of_observations, replace = T), 
                   speeder2 = sample(x = speeding, size = number_of_observations, replace = T),
                   speeder3 = sample(x = speeding, size = number_of_observations, replace = T),
                   speeder4 = sample(x = speeding, size = number_of_observations, replace = T),
                   speeder5 = sample(x = speeding, size = number_of_observations, replace = T))
}

data_list <- lapply(X = floor(runif(n = 5, min = 50, max = 4*k)), 
                    FUN = function(z){createDataTable(z)})

# Subset dummy data based on one column at a time and return 
# the number of observations, direction, speeder2 and speeder3 from the subset 
cols <- sapply(1:5, function(z) paste('speeder',z,sep = ""))

ret <- lapply(cols, function(z){
  lapply(data_list, function(x){
    return(x[get(z) == 'speeding', .(nrows = .N, direction, speeder2, speeder3)])
  })
})

The structure of ret is consistent with what we expect. Each item is a list of 5 data.table objects that have 4 columns each.

> summary(ret)
     Length Class  Mode
[1,] 5      -none- list
[2,] 5      -none- list
[3,] 5      -none- list
[4,] 5      -none- list
[5,] 5      -none- list
> summary(ret[[1]])
     Length Class      Mode
[1,] 4      data.table list
[2,] 4      data.table list
[3,] 4      data.table list
[4,] 4      data.table list
[5,] 4      data.table list

A quick test to see if the code is working well and not subsetting incorrectly, here's a simplified call with only the number of observations/rows for the subset criteria:

> unlist(lapply(cols, function(z){
+     lapply(data_list, function(x){
+         return(x[get(z) == 'speeding', .(nrows = .N)])
+     })
+ }))
nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows 
  113    82    24   112   185    97    63    22   110   193   103    78    35   115   197   110    74 
nrows nrows nrows nrows nrows nrows nrows nrows 
   26   103   194   107    84    25    97   191

How do I subset multiple data frames using multiple variables

Question

1 answers

solution1
0 2018-07-30 15:48:06

How do I subset multiple data frames using multiple variables

Question

1 answers

solution1 0 2018-07-30 15:48:06

solution1
0 2018-07-30 15:48:06