简体   繁体   中英

replace some column values from a data.frame based on another data.frame

I have two data.frames, (df1, df2) and I would like to replace the values in columns P1-P10 the letters with the values of df1$V2 but keeping the first two columns of df2.

df1 = data.frame(V1=LETTERS, V2=rnorm(26))

df2 <- data.frame(Name=sample(LETTERS, 6), bd=sample(1:6), P1=sample(LETTERS,6), P2=sample(LETTERS, 6), P3=sample(LETTERS, 6), P4=sample(LETTERS, 6), P5=sample(LETTERS, 6), P6=sample(LETTERS, 6), P7=sample(LETTERS, 6), P8=sample(LETTERS, 6), P9=sample(LETTERS, 6), P10=sample(LETTERS, 6))

My approach is the following:

df3 <- matrix(setNames(df1[,2], df1[,1])[as.character(unlist(df2[,3:12]))], nrow=6, ncol=10)
df4 <- data.frame(cbind(df2[,1:2], df3))

Which gives me my desire output, my real data has 10,000 columns, is there any way to avoid the cbind step or make the process fast?

> df4
Name bd         X1          X2         X3         X4         X5         X6        X7         X8         X9        X10
1    V  6 -1.8991102  0.40269050 -0.1517500 -2.5297829  1.5315622  1.4897071  1.364071 -1.2443708 -1.3197276 -0.4917057
2    T  1 -2.5297829 -0.44614123 -0.1894970 -0.6693774 -0.1517500 -1.0650962 -0.151750 -0.4461412 -0.6693774 -1.1351770
3    R  5 -0.6693774  0.09059365 -2.5297829  0.3233827 -0.9383348 -0.4461412  1.281797  1.5315622  1.4897071 -0.4461412
4    B  4 -0.4461412 -0.93833476 -1.2443708 -0.4461412 -0.1894970 -0.9383348 -1.135177 -1.8991102 -0.1894970  0.4026905
5    K  2 -1.0180271 -1.06509624 -0.1939600 -0.1894970  1.4897071 -0.6693774 -1.899110 -1.3197276  1.5315622 -0.1517500
6    Y  3  1.5315622 -0.19396005 -0.4917057 -0.4664239 -1.8991102  0.4026905 -1.065096 -0.9383348 -1.2443708 -0.4664239

Thanks

You can match the values of df2[3:12] in df1[[1]] . These row numbers are used to extract the values from df1[2] .

df2[3:12] <- df1[match(as.character(unlist(df2[3:12])), 
                       as.character(df1[[1]])), 2]

The result ( df2 ):

  Name bd         P1         P2         P3         P4         P5         P6         P7         P8         P9        P10
1    H  5  0.1199355  0.3752010 -0.3926061 -1.1039548 -0.1107821  0.9867373 -0.3360094 -0.7488000 -0.3926061  2.0667704
2    U  4  0.1168599  0.1168599  0.9867373  1.3521418  0.9867373 -0.3360094 -0.7724007 -0.3926061 -0.3360094 -1.2543480
3    R  3 -1.2337890 -0.1107821 -0.7724007  2.0667704  0.3752010  0.4645504  0.9867373  0.1168599 -0.0981773 -0.3926061
4    G  2 -0.3926061  0.3199261 -0.0981773 -0.1107821  2.0667704 -1.1039548 -1.2337890  0.3199261 -1.2337890 -2.1534678
5    C  6 -2.1534678 -1.1039548 -1.1039548 -0.7488000  0.4645504  0.3199261 -2.1534678 -0.3360094  0.9867373  0.8771467
6    I  1  0.6171634  0.6224091  1.8011711  0.7292998  0.8771467  2.0667704  0.3752010  0.4645504 -2.1534678 -0.7724007

If you don't want to replace the values inside df2 , you can create a new data frame df4 with

df4 <- "[<-"(df2, 3:12, value = df1[match(as.character(unlist(df2[3:12])), 
                                          as.character(df1[[1]])), 2])

Try some *pply magic:

lookup<-tapply(df1$V2, df1$V1, unique) #Creates a lookup table
lookup.function<-function(x) as.numeric(lookup[as.character(x)]) #The function
df4<-data.frame(df2[,1:2], apply(df2[,3:12], 2,lookup.function )) #Builds the output

Update :

The *pply family is much faster than merge , at least an order of magnitude. Check this out

num<-1000
df1 = data.frame(V1=LETTERS, V2=rnorm(26))
df2<-data.frame(cbind(first=1:num,second=1:num, matrix(sample(LETTERS, num^2, replace=T), nrow=num, ncol=num)))


start<-Sys.time()
lookup<-tapply(df1$V2, df1$V1, unique)
lookup.function<-function(x) as.numeric(lookup[as.character(x)])
df4<-data.frame(cbind(df2[,1:2], data.frame(apply(df2[,3:(num+2)], 2, lookup.function ))))
(difftime(Sys.time(),start))


start<-Sys.time()
df4.merge <- "[<-"(df2, 3:num, value = df1[match(as.character(unlist(df2[3:num])), as.character(df1[[1]])), 2])
(difftime(Sys.time(),start))

sum(df4==df4.merge)==num^2

For 3000 columns and rows the *pply combination needs 4.3s whereas merge needs about 22s on my slow Intel. And it scales nicely. For 4000 columns and rows the respective times are 7.4 sec and 118 sec.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM