I have a df :
head(df)
:
Year Asset1 Asset2 Asset3 Asset4 Asset5
1 1857 1729900 32570 288482 1251642 0 0 67374 89832
2 1858 1870213 35255 312262 1354817 0 0 71948 95931
3 1859 1937622 36418 322562 1399505 0 0 76773 102364
4 1860 1969257 207557 83393 1484403 0 0 83102 110802
5 1861 2107481 222969 89585 1594627 0 0 85843 114457
6 1862 2306227 235498 94619 1684234 0 0 80613 211263
I use ddply
to construct a new df where Asset 2:5 are divided by Asset1:
dft<-ddply(df,.(Year),transform,
Asset2=Asset2/Asset1,
Asset3=Asset3/Asset1,
Asset4=Asset4/Asset1,
Asset5=Asset5/Asset1)
But it is quiet a lot of job if there are a lot of columns... Any suggestions?
Best Regards!
This is sort of what sweep
is for:
Read in a (modified) version of your data:
m <- read.table(text = " Year Asset1 Asset2 Asset3 Asset4 Asset5
+ 1857 1729900 32570 288482 1251642 0
+ 1858 1870213 35255 312262 1354817 0
+ 1859 1937622 36418 322562 1399505 0
+ 1860 1969257 207557 83393 1484403 0
+ 1861 2107481 222969 89585 1594627 0
+ 1862 2306227 235498 94619 1684234 0 ",header = TRUE,sep = "")
> m
Year Asset1 Asset2 Asset3 Asset4 Asset5
1 1857 1729900 32570 288482 1251642 0
2 1858 1870213 35255 312262 1354817 0
3 1859 1937622 36418 322562 1399505 0
4 1860 1969257 207557 83393 1484403 0
5 1861 2107481 222969 89585 1594627 0
6 1862 2306227 235498 94619 1684234 0
> m[,3:6] <- sweep(m[,3:6],1,m[,2],"/")
> m
Year Asset1 Asset2 Asset3 Asset4 Asset5
1 1857 1729900 0.01882768 0.16676224 0.7235343 0
2 1858 1870213 0.01885079 0.16696601 0.7244186 0
3 1859 1937622 0.01879520 0.16647313 0.7222797 0
4 1860 1969257 0.10539864 0.04234744 0.7537884 0
5 1861 2107481 0.10579882 0.04250809 0.7566507 0
6 1862 2306227 0.10211397 0.04102762 0.7302984 0
Ok I have 2 lapply
solutions. I bechmarked the solutions above and the loop is actually faster than the vectorized solution. Why?
EDIT: See nograpes answer.
lapply
solution:
m[, 3:6] <- do.call(cbind, lapply(m[, 3:6], function(x) x/m[, 2]))
m
And lapply2:
lapply(3:6, function(i) {
m[, i] <<- m[, i]/m[, 2]
})
# Year Asset1 Asset2 Asset3 Asset4 Asset5
# 1 1857 1729900 0.01882768 0.16676224 0.7235343 0
# 2 1858 1870213 0.01885079 0.16696601 0.7244186 0
# 3 1859 1937622 0.01879520 0.16647313 0.7222797 0
# 4 1860 1969257 0.10539864 0.04234744 0.7537884 0
# 5 1861 2107481 0.10579882 0.04250809 0.7566507 0
# 6 1862 2306227 0.10211397 0.04102762 0.7302984 0
The benching with microbenchmarking on a i7 windows machine with 1000 replications:
The setup:
LAPPLY <- function() {
m[, 3:6] <- do.call(cbind, lapply(m[, 3:6], function(x) x/m[, 2]))
m
}
LOOP <- function() {
for(i in 3:ncol(m)) {
m[ ,i] <- m[ , i]/m[ ,2]
}
m
}
SWEEP <- function(){
m[,3:6] <- sweep(m[,3:6],1,m[,2],"/")
m
}
LAPPLY2 <- function() {
lapply(3:6, function(i) {
m[, i] <<- m[, i]/m[, 2]
})
m
}
VECTORIZED <- function(){
m[,3:6]<-m[,3:6] / m[,2]
m
}
VECTORIZED2 <- function(){
m[,3:6]<-unlist(m[,3:6])/m[,2]
m
}
microbenchmark(
SWEEP(),
LAPPLY(),
LOOP(),
VECTORIZED(),
VECTORIZED2(),
LAPPLY2(),
times=1000L)
Results:
Unit: microseconds
expr min lq median uq max
1 LAPPLY() 7483.059 7577.758 7649.3655 7839.9290 41808.754
2 LAPPLY2() 563.061 602.713 618.3405 661.9585 7535.308
3 LOOP() 540.669 581.254 594.7820 626.5050 35505.929
4 SWEEP() 2544.735 2602.581 2645.9650 2735.5320 8335.814
5 VECTORIZED() 2409.452 2454.235 2494.5870 2585.5535 37313.134
6 VECTORIZED2() 8952.055 9063.081 9153.8150 9352.3085 45742.247
EDIT: Though I get a speed up by passing indexes to lapply
and globally assigning which is what a loop is doing anyway ( lapply
is a wrapper for a loop I believe):
NOTE: The LAPPLY2 has to be benchmarked last because it makes global changes to m (and m has to be reset after running LAPPLY2). A deomonstration of why global assignment can be dangerous.
Also I repeated the data frame from the OP 100 times (nrow x 100) to be a betetr simulation of the solutions.
EDIT 37 partB: Here's my results without duplicating the data frame as well as how I duplicate the dataframe:
# Unit: microseconds
# expr min lq median uq max
# 1 LAPPLY() 428.710 451.5680 468.362 485.6220 1497.452
# 2 LAPPLY2() 331.212 355.9365 368.532 386.7260 1361.235
# 3 LOOP() 326.547 355.0040 369.465 383.9260 1361.235
# 4 SWEEP() 828.497 868.1490 890.541 924.5950 31512.726
# 5 VECTORIZED() 764.587 809.8370 828.497 859.9855 3042.486
# 6 VECTORIZED2() 374.596 394.6560 408.884 424.0460 1399.954
dfdup <- function(dataframe, repeats=10){
DF <- dataframe[rep(seq_len(nrow(dataframe)), repeats), ]
rownames(DF) <-NULL
DF
}
m <- dfdup(m, 100)
I think this is a nice, readble alternative:
df[,3:6]<-df[,3:6] / df[,2]
If you want to make it a little more readable, you could do
df[,paste0('Asset',2:5)]<-df[,paste0('Asset',2:5)] / df[,'Asset1']
I found that the above functions are slow because it gets passed into Ops.data.frame
(I think), and that is slow. To avoid this:
df[,3:6]<-unlist(df[,3:6])/df[,2]
But it only gets as fast as the other loop and lapply
versions.
This is not really what ddply
is meant for, and you don't need it in this case. ddply
is good for splitting a data frame into rows, based on the value in one of the columns. Usually the column that you are using to split the data frame (in this case, Year
) would have multiple rows with the same value.
Here, you are just dividing one column by another. You can do this as follows:
df$Asset2 <- df$Asset2/df$Asset1 #more human-readable
or
df[ ,3] <- df[ ,3]/df[ ,2] #numbered columns are useful in loops
I suspect there's a vectorized way to do what you want, but unless speed is a major concern it is pretty simple to loop this calculation:
#[hide under desk to avoid vectorization police]
for(i in 3:ncol(df) {
df[ ,i] <- df[ , i]/df[ ,2]
}
IMO you might want to rename your columns, or preserve the old ones and make new ones, in order to avoid getting confused about whether the column contains the ratio or the original value. If you want to make new columns, just use df[ ,ncol(df)+1] <- df[ , i]/df[ ,2]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.