简体   繁体   中英

R Compare Columns across Dataframes to Match Values

I have two dataframes, looking at houses (n=6) and certain dates (n=22).

  • ORIGINAL is the original dataset. It contains 38 observations on 5 variables. Not all houses have all the dates listed, and vice versa, leading to errors in calculations with different length variables.

  • SAMPLE is a new empty dataset. It contains 132 (6 x 22) observations on the same 5 variables. Now there is an observation for every household for every date.

    House Day Mongoose Fruit Elephant

      A 1 40 7 0.6 A 6 32 12 4.2 B 2 50 3 4.0 B 4 51 4 8.6 B 6 8 7 12.1 C 2 12 8 13.0 

I am trying to fill in the rest of SAMPLE by asking R to compare HouseID and Date between the two dataframes; if they match, the rest of the variables (mongoose, fruit, elephant) should be copied over for that observation.

I tried this to no avail...

 for(i in 1:nrow(original))
{
  if ((sample$Day == original$Day) && (sample$House == original$House)) 
  {
    sample$Mongoose[i]  <- original$Mongoose[i] 
    sample$Fruit[i]     <- original$Fruit[i] 
    sample$Elephant[i]  <- original$Elephant[i] 
  }
}

The following results:

I get the following 3 errors in sequence

  • In sample$Day == test$Day : longer object length is not a multiple of shorter object length

  • In is.na(e1) | is.na(e2) :longer object length is not a multiple of
    shorter object length

  • In ==.default (sample$House, test$House) :longer object length is
    not a multiple of shorter object length

The data DOES copy over, but incorrectly. All the values get transferred to the A house and sequential date, rather than the appropriate house and date.

Ie, it looks like this

House   Day   Mongoose   Fruit  Elephant 
 A       1      40        7      0.6
 A       2      50        3      4.0
 A       3      51        4      8.6
 A       4       8        7      12.1
 A       5      12        8      13.0
 A       6      32        12     4.2
 B       1 
 B       2 
 B       3 [...]  

When it should (in essence) look like this:

House   Day   Mongoose   Fruit  Elephant 
 A       1      40        7      0.6
 A       2    
 A       3     
 A       4      
 A       5      
 A       6      32        12     4.2    [rest of A houses have no data]
 B       1 
 B       2      50        3      4.0 
 B       3   
 B       4      51        4      8.6
 B       5 
 B       6       8        7      12.1    [rest of B houses have no data]
 C       1    
 C       2      12        8      13.0

Please advise; I will eventually have to extend this technique to look at a sample dataset with 198K entries, and a test dataset with 115K.

Thanks!

Sounds to me like this should work:

merge(sample, original, by = c("House", "Day"), all.x = TRUE)

But hard to tell without a reproducible example. You may also want to look into dplyr::left_join() . That is, assuming your data looks like the following:

sample <- data.frame(House = rep(c("A", "B", "C"), each = 6),
                     Day = rep(1:6, 3))

# head(sample)
#   House Day
# 1     A   1
# 2     A   2
# 3     A   3
# 4     A   4
# 5     A   5
# 6     A   6

original <- data.frame(House = c("A", "A", "B", "B", "C"),
                       Day = c(1, 6, 2, 4, 2),
                       Mongoose = c(40, 32, 50, 51, 8),
                       Fruit = c(7, 12, 3, 4, 8),
                       Elephant = c(0.6, 4.2, 4.0, 8.6, 12.1))

# head(original)
#   House Day Mongoose Fruit Elephant
# 1     A   1       40     7      0.6
# 2     A   6       32    12      4.2
# 3     B   2       50     3      4.0
# 4     B   4       51     4      8.6
# 5     C   2        8     8     12.1

We obtain:

# head(merge(sample, original, by = c("House", "Day"), all.x = TRUE))
#   House Day Mongoose Fruit Elephant
# 1     A   1       40     7      0.6
# 2     A   2       NA    NA       NA
# 3     A   3       NA    NA       NA
# 4     A   4       NA    NA       NA
# 5     A   5       NA    NA       NA
# 6     A   6       32    12      4.2

It could be a small tweak, look at this this line of your original code:

if ((sample$Day == original$Day) && (sample$House == original$House)) 

See if you can change it to this:

if ((sample$Day[i] == original$Day[i]) && (sample$House[i] == original$House[i]))

Because:

  • You are using a for loop with an i variable,
  • which you do very well with lines such as sample$Mongoose[i] <- original$Mongoose[i]
  • but in your example it seems the if statement is not actually making use of the i variable
  • so we revise it to make use of i so it will be comparing specifically that observation/rows's sample$Day with that observation/rows's original$Day , and the same for sample$House vs original$House

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM