简体   繁体   中英

Loop over several columns with ddply

I have a df :

head(df) :

  Year              Asset1       Asset2        Asset3 Asset4    Asset5 
1 1857              1729900        32570       288482 1251642      0                     0     67374            89832
2 1858              1870213        35255       312262 1354817      0                     0     71948            95931
3 1859              1937622        36418       322562 1399505      0                     0     76773           102364
4 1860              1969257       207557        83393 1484403      0                     0     83102           110802
5 1861              2107481       222969        89585 1594627      0                     0     85843           114457
6 1862              2306227       235498        94619 1684234      0                     0     80613           211263

I use ddply to construct a new df where Asset 2:5 are divided by Asset1:

dft<-ddply(df,.(Year),transform, 

              Asset2=Asset2/Asset1,
              Asset3=Asset3/Asset1,
              Asset4=Asset4/Asset1,
              Asset5=Asset5/Asset1)

But it is quiet a lot of job if there are a lot of columns... Any suggestions?

Best Regards!

This is sort of what sweep is for:

Read in a (modified) version of your data:

m <- read.table(text = " Year              Asset1       Asset2        Asset3 Asset4    Asset5 
+  1857              1729900        32570       288482 1251642      0                     
+  1858              1870213        35255       312262 1354817      0                     
+  1859              1937622        36418       322562 1399505      0                     
+  1860              1969257       207557        83393 1484403      0                     
+  1861              2107481       222969        89585 1594627      0            
+  1862              2306227       235498        94619 1684234      0   ",header = TRUE,sep = "")
> m
  Year  Asset1 Asset2 Asset3  Asset4 Asset5
1 1857 1729900  32570 288482 1251642      0
2 1858 1870213  35255 312262 1354817      0
3 1859 1937622  36418 322562 1399505      0
4 1860 1969257 207557  83393 1484403      0
5 1861 2107481 222969  89585 1594627      0
6 1862 2306227 235498  94619 1684234      0


> m[,3:6] <- sweep(m[,3:6],1,m[,2],"/")
> m
  Year  Asset1     Asset2     Asset3    Asset4 Asset5
1 1857 1729900 0.01882768 0.16676224 0.7235343      0
2 1858 1870213 0.01885079 0.16696601 0.7244186      0
3 1859 1937622 0.01879520 0.16647313 0.7222797      0
4 1860 1969257 0.10539864 0.04234744 0.7537884      0
5 1861 2107481 0.10579882 0.04250809 0.7566507      0
6 1862 2306227 0.10211397 0.04102762 0.7302984      0

Ok I have 2 lapply solutions. I bechmarked the solutions above and the loop is actually faster than the vectorized solution. Why?

EDIT: See nograpes answer.

lapply solution:

m[, 3:6] <- do.call(cbind, lapply(m[, 3:6], function(x) x/m[, 2]))
m

And lapply2:

lapply(3:6, function(i) {
    m[, i] <<- m[, i]/m[, 2]
})

#   Year  Asset1     Asset2     Asset3    Asset4 Asset5
# 1 1857 1729900 0.01882768 0.16676224 0.7235343      0
# 2 1858 1870213 0.01885079 0.16696601 0.7244186      0
# 3 1859 1937622 0.01879520 0.16647313 0.7222797      0
# 4 1860 1969257 0.10539864 0.04234744 0.7537884      0
# 5 1861 2107481 0.10579882 0.04250809 0.7566507      0
# 6 1862 2306227 0.10211397 0.04102762 0.7302984      0

The benching with microbenchmarking on a i7 windows machine with 1000 replications:

The setup:

LAPPLY <- function() {
    m[, 3:6] <- do.call(cbind, lapply(m[, 3:6], function(x) x/m[, 2]))
    m
}

LOOP <- function() {
    for(i in 3:ncol(m)) {
      m[ ,i] <- m[ , i]/m[ ,2]
    }
    m
}

SWEEP <- function(){
    m[,3:6] <- sweep(m[,3:6],1,m[,2],"/")
    m
}

LAPPLY2 <- function() {
    lapply(3:6, function(i) {
        m[, i] <<- m[, i]/m[, 2]
    })
        m
}

VECTORIZED <- function(){
    m[,3:6]<-m[,3:6] / m[,2]
    m
}

VECTORIZED2 <- function(){
    m[,3:6]<-unlist(m[,3:6])/m[,2]
    m
}

microbenchmark( 
    SWEEP(),
    LAPPLY(),
    LOOP(), 
    VECTORIZED(),
    VECTORIZED2(),
    LAPPLY2(),
    times=1000L)  

Results:

Unit: microseconds
           expr      min       lq    median        uq       max
1      LAPPLY() 7483.059 7577.758 7649.3655 7839.9290 41808.754
2     LAPPLY2()  563.061  602.713  618.3405  661.9585  7535.308
3        LOOP()  540.669  581.254  594.7820  626.5050 35505.929
4       SWEEP() 2544.735 2602.581 2645.9650 2735.5320  8335.814
5  VECTORIZED() 2409.452 2454.235 2494.5870 2585.5535 37313.134
6 VECTORIZED2() 8952.055 9063.081 9153.8150 9352.3085 45742.247

在此输入图像描述

EDIT: Though I get a speed up by passing indexes to lapply and globally assigning which is what a loop is doing anyway ( lapply is a wrapper for a loop I believe):

NOTE: The LAPPLY2 has to be benchmarked last because it makes global changes to m (and m has to be reset after running LAPPLY2). A deomonstration of why global assignment can be dangerous.

Also I repeated the data frame from the OP 100 times (nrow x 100) to be a betetr simulation of the solutions.

EDIT 37 partB: Here's my results without duplicating the data frame as well as how I duplicate the dataframe:

# Unit: microseconds
#            expr     min       lq  median       uq       max
# 1      LAPPLY() 428.710 451.5680 468.362 485.6220  1497.452
# 2     LAPPLY2() 331.212 355.9365 368.532 386.7260  1361.235
# 3        LOOP() 326.547 355.0040 369.465 383.9260  1361.235
# 4       SWEEP() 828.497 868.1490 890.541 924.5950 31512.726
# 5  VECTORIZED() 764.587 809.8370 828.497 859.9855  3042.486
# 6 VECTORIZED2() 374.596 394.6560 408.884 424.0460  1399.954


dfdup <- function(dataframe, repeats=10){
    DF <- dataframe[rep(seq_len(nrow(dataframe)), repeats), ]
    rownames(DF) <-NULL
    DF
}

m <- dfdup(m, 100)

I think this is a nice, readble alternative:

df[,3:6]<-df[,3:6] / df[,2]

If you want to make it a little more readable, you could do

df[,paste0('Asset',2:5)]<-df[,paste0('Asset',2:5)] / df[,'Asset1']

I found that the above functions are slow because it gets passed into Ops.data.frame (I think), and that is slow. To avoid this:

df[,3:6]<-unlist(df[,3:6])/df[,2]

But it only gets as fast as the other loop and lapply versions.

This is not really what ddply is meant for, and you don't need it in this case. ddply is good for splitting a data frame into rows, based on the value in one of the columns. Usually the column that you are using to split the data frame (in this case, Year ) would have multiple rows with the same value.

Here, you are just dividing one column by another. You can do this as follows:

df$Asset2 <- df$Asset2/df$Asset1 #more human-readable

or

df[ ,3] <- df[ ,3]/df[ ,2] #numbered columns are useful in loops

I suspect there's a vectorized way to do what you want, but unless speed is a major concern it is pretty simple to loop this calculation:

#[hide under desk to avoid vectorization police]
for(i in 3:ncol(df) {
  df[ ,i] <- df[ , i]/df[ ,2]
}

IMO you might want to rename your columns, or preserve the old ones and make new ones, in order to avoid getting confused about whether the column contains the ratio or the original value. If you want to make new columns, just use df[ ,ncol(df)+1] <- df[ , i]/df[ ,2]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM