简体   繁体   中英

Subsetting large data frames

Is there a fast and clever way that would, lets say from DF like this

vec <- data.frame(Names = c("var1","var2","var3","var4","var5","var6","var7",
                            "var8","var9","var10","var11","var12","var13",
                            "var14") ,
                  phase1= runif(14),
                  phase1.away= runif(14),
                  phase1_in= runif(14),
                  phase1_out= runif(14),
                  phase1.1= runif(14),
                  phase1.away.1= runif(14),
                  phase1_in.1= runif(14),
                  phase1_out.1= runif(14),
                  phase1.2= runif(14),
                  phase1.away.2= runif(14),
                  phase1_in.2= runif(14),
                  phase1_out.2= runif(14))

give a new DF as this:

-allways order according phase1.x, give the names of variables corresponding to the values, phase1_in and phase1_out values but not phase1.away.

What I am doing is simply

vec.o<-vec[with(vec, order(-phase1)),]
d1<-vec.o[c("Names","phase1","phase1_in","phase1_out")]

vec.o<-vec[with(vec, order(-phase1.1)),]
d2<-vec.o[c("Names","phase1.1","phase1_in.1","phase1_out.1")]

cbind(d1,d2)

which is extremely boring and I am also sure anti R-ish. Any clever ideas? I am dealing with large data frames permanently and R seems to be a bit cumbersome. Is there any good literature one would reccomend for these purposes? (load many variables, create names to them, operations with those variables etc..., )

EDIT: corrected for the case phase.x goes to 10 and higher.

I presume you have quite a lot more than phase1.1, phase1.2, so a general solution using regular expressions would be something along the lines of:

# Make an id vector for the phase1.x, and give Names id -1
# gives a warning as character is transformed to NA
id <- as.numeric(gsub(".*\\.(\\d+$)","\\1",names(vec)))
id[1] <- -1
id[is.na(id)] <- 0 # first occurence, no .x


veclist <- lapply(unique(id)[-1],function(i){
    #select all variables necessary, exclude the away
    out <- vec[id %in% c(i,-1) & !grepl("away",names(vec))]
    # find the phase1.x for ordering
    ovec <- grepl("phase1(\\.\\d+)?$",names(out))
    # order and produce
    out[order(out[,ovec]),]
})

do.call(cbind,veclist)

It is based on recognition of the last number preceded by a dot, and cuts that out. If there is no last number preceded by a dot, it's either the Names variable (which I indicate with -1), or the first phase (which I indicate with 0).

Now you have an id vector that can easily select the variables that belong together, so you can loop over the unique values of id, except the first (being -1). Using regular expressions again you can get whatever variable you want for the construction of a new dataframe. The do.call on the end combines all those dataframes again.

Btw, Ordering sub-dataframes goes quite a lot faster than ordering the original dataframe first and then selecting your variables. This is the gain you have in the solution of nullglob.

This is not particularly clever, but it is over twice as fast (according to my simple benchmark):

o1 <- order(-vec$phase1)   
o2 <- order(-vec$phase1.1)
cbind(vec[o1,c("Names","phase1","phase1_in","phase1_out")],
         vec[o2,c("Names","phase1.1","phase1_in.1","phase1_out.1")])

The benchmark is here:

> n <- 2e5
> vec<-data.frame(Names = as.character(runif(n)) ,
+                  phase1= runif(n),
+                  phase1.away= runif(n),
+                  phase1_in= runif(n),
+                  phase1_out= runif(n),
+                  phase1.1= runif(n),
+                  phase1.away.1= runif(n),
+                  phase1_in.1= runif(n),
+                  phase1_out.1= runif(n),
+                  phase1.2= runif(n),
+                  phase1.away.2= runif(n),
+                  phase1_in.2= runif(n),
+                  phase1_out.2= runif(n))
>
>
> test1 <- function(){
+   vec.o<-vec[with(vec, order(-phase1)),]
+   d1<-vec.o[c("Names","phase1","phase1_in","phase1_out")]
+   vec.o<-vec[with(vec, order(-phase1.1)),]
+   d2<-vec.o[c("Names","phase1.1","phase1_in.1","phase1_out.1")]
+   d3 <- cbind(d1,d2)
+ }
> system.time(test1())
   user  system elapsed
  1.764   0.048   1.811
>
>
> test2 <- function(){
+   o1 <- order(-vec$phase1)
+   o2 <- order(-vec$phase1.1)
+   d4 <- cbind(vec[o1,c("Names","phase1","phase1_in","phase1_out")],
+               vec[o2,c("Names","phase1.1","phase1_in.1","phase1_out.1")])
+ }
> system.time(test2())
   user  system elapsed
  0.736   0.056   0.791

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM