简体   繁体   中英

Loop over multiple columns in a merge in R

I am trying to loop the merging of two dataframes over multiple columns, but I'm having trouble with the code and haven't been able to find any answers on SO. Here are some example data frames:

box <- c(5,7,2)
year <- c(1999,1999,1999)
rep5 <- c(5,5,5)
rep7 <- c(7,7,7)
rep2 <- c(2,2,2)
df1 <- data.frame(box,year,rep5,rep7,rep2)

box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)

df1
  box year rep5 rep7 rep2
1   5 1999    5    7    2
2   7 1999    5    7    2
3   2 1999    5    7    2

df2
   box1 box2 year2 distance
1     5    5  1999        0
2     5    7  1999      100
3     5    2  1999      200
4     5    5  2000        0
5     7    5  1999      100
6     7    7  1999        0
7     7    2  1999      300
8     7    4  1999      200
9     2    5  1999      200
10    2    7  1999      300
11    2    2  1999        0
12    2    9  1999      300

What I am trying to do is get the distance information from df2 into df1, with df1 year matched to df2 year, df1 box matched to df2 box1, and df1 rep[i] matched to df2 box2. I can do this for a single df1 rep[i] column as follows:

merge(df1, df2, by.x=c("box", "rep5", "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)

this gives the desired output:

  box rep5 year rep7 rep2 distance
1   2    5 1999    7    2      200
2   5    5 1999    7    2        0
3   7    5 1999    7    2      100

However, in order to save doing this for each rep[i] column individually (I have a lot of these columns in the real data set), I'd like to be able to loop over those columns. Here is the code I have tried to do that:

reps <- c(df1$rep7, df1$rep2)
df3 <- for (i in reps) {merge(df1, df2, by.x=c("box", i, "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)}
df3

When I run that code, I get the error "Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column." I also tried defining

reps <- c("rep7", "rep2")

When I run the same code using that definition, I get the result that df3 is NULL.

The output that I want (with the distance column renamed for clarity) is:

  box year rep5 rep7 rep2 dist5 dist7 dist2
1   2 1999    5    7    2   200   300     0
2   5 1999    5    7    2     0   100   200
3   7 1999    5    7    2   100     0   300

What am I doing wrong? Any help you can give me would be very much appreciated!

My R life became so much easier when I learned about the libraries dplyr and tidyr , and the concept of tidy data sets . What you're trying to do above can be expressed as a pivot , and is pretty easy to do with dplyr and tidyr .

I'm assuming what you really want, is to turn df2:

   box1 box2 year2 distance
1     5    5  1999        0
2     5    7  1999      100
3     5    2  1999      200
4     5    5  2000        0
5     7    5  1999      100
6     7    7  1999        0
7     7    2  1999      300
8     7    4  1999      200
9     2    5  1999      200
10    2    7  1999      300
11    2    2  1999        0
12    2    9  1999      300

into your output, with all those strange repetitions removed:

  box year dist5 dist7 dist2
1   2 1999   200   300     0
2   5 1999     0   100   200
3   7 1999   100     0   300

So you should pivot box2 into columns, with your distance as the value. using dplyr and tidyr :

library(tidyr)
box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)

# reshape it as desired
spread(df2, box2, distance,fill=0)
#Source: local data frame [4 x 7]

#  box1 year2   2   4   5   7   9
#1    2  1999   0   0 200 300 300
#2    5  1999 200   0   0 100   0
#3    5  2000   0   0   0   0   0
#4    7  1999 300 200 100   0   0

My recommendation: learn to use dplyr and tidyr . It makes life so, so much easier.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM