I want to subset five data frames based on a series of 31 variables. The data frames are stored in a list:
long_data_sets <- list(scale_g1, scale_g2, scale_g3, scale_g4, scale_g5)
Each of the five data frames includes the identical set of columns, among others, 31 factors called "speeder_225" through "speeder_375":
> str(scale_g1[53:83])
'data.frame': 5522 obs. of 31 variables:
$ speeder_225: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_230: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_235: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_240: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speeder_245: Factor w/ 2 levels "Non-Speeder",..: 1 1 1 1 1 1 1 1 1 1 ...
...
I want to subset the data frames based on one of the 31 factor variables at a time, so that I end up with 5*31 new data frames.
I created the function for subsetting that retains only two columns that I need going forward ("direction" and "response"):
create_speeder_data <- function(x, y){
df <- subset(x, x[,y] == "Speeder",
select = c("direction", "response"))
}
This allows me to create one new data frame at a time:
create_speeder_data(scale_g1, "speeder_225")
I tried to apply the function using map2() and the list of 5 data frames and a list of the 31 factor names, but this obviously did not work.
> speeder_var <- names(scale_g1[53:83])
> map2(long_data_sets, speeder_var, create_speeder_data)
Error: `.x` (5) and `.y` (31) are different lengths
The closest I could get was to take out the y argument from my function and apply the function to the list of five data frames for one of the 31 factors.
#Create subsetting function for "speeder_225"
create_speeder_225_data <- function(x){
df <- subset(x, x$speeder_225 == "Speeder",
select = c("direction", "response"))
}
#Map function to list of data frames
z_speeder_225 <- map(long_data_sets, create_speeder_225_data)
#Change names of new data frames in list
names(long_data_sets) <- c("g1", "g2", "g3", "g4", "g5")
names(z_speeder_225) <- paste0(names_long_data_sets, "speeder_225")
#Get data frames from list
list2env(z_speeder_225, envir=.GlobalEnv)
I would need to repeat this 30 more times to get to my 5*31 data frames. There must be an easier way to do that.
Any help is much appreciated!
I agree with @Mako212 - you may need to reconsider what you're trying to do. However, here's something that ought to work.
the below code will subset each dataset in the list. In the test data there are 5 categorical variables each with two levels. Since the otuput is based on 1 level only ( speeding
), the output will be 5 x 5 = 25 datasets. This is organized as a list of lists (5 x 5):
library(data.table)
# Creating some dummy data
k <- 100
directions <- as.vector(sapply(c('North', 'West', 'South', 'East'), function (z) return(rep(z, k))))
speeding <- as.vector(sapply(c('speeding', 'not-speeding'), function (z) return(rep(z, k))))
# Test data - number_of_observations <= 4*k
createDataTable <- function(number_of_observations = 50){
dt <- data.table(direction = sample(x = directions, size = number_of_observations, replace = T),
speeder1 = sample(x = speeding, size = number_of_observations, replace = T),
speeder2 = sample(x = speeding, size = number_of_observations, replace = T),
speeder3 = sample(x = speeding, size = number_of_observations, replace = T),
speeder4 = sample(x = speeding, size = number_of_observations, replace = T),
speeder5 = sample(x = speeding, size = number_of_observations, replace = T))
}
data_list <- lapply(X = floor(runif(n = 5, min = 50, max = 4*k)),
FUN = function(z){createDataTable(z)})
# Subset dummy data based on one column at a time and return
# the number of observations, direction, speeder2 and speeder3 from the subset
cols <- sapply(1:5, function(z) paste('speeder',z,sep = ""))
ret <- lapply(cols, function(z){
lapply(data_list, function(x){
return(x[get(z) == 'speeding', .(nrows = .N, direction, speeder2, speeder3)])
})
})
The structure of ret
is consistent with what we expect. Each item is a list of 5 data.table
objects that have 4 columns each.
> summary(ret)
Length Class Mode
[1,] 5 -none- list
[2,] 5 -none- list
[3,] 5 -none- list
[4,] 5 -none- list
[5,] 5 -none- list
> summary(ret[[1]])
Length Class Mode
[1,] 4 data.table list
[2,] 4 data.table list
[3,] 4 data.table list
[4,] 4 data.table list
[5,] 4 data.table list
A quick test to see if the code is working well and not subsetting incorrectly, here's a simplified call with only the number of observations/rows for the subset criteria:
> unlist(lapply(cols, function(z){
+ lapply(data_list, function(x){
+ return(x[get(z) == 'speeding', .(nrows = .N)])
+ })
+ }))
nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows nrows
113 82 24 112 185 97 63 22 110 193 103 78 35 115 197 110 74
nrows nrows nrows nrows nrows nrows nrows nrows
26 103 194 107 84 25 97 191
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.