R: Loop through a set of values in one dataframe update a second dataframe

Question

Updated to a more realistic example; this time added duplicates in interp_b.

I am trying to populate a field in one dataframe ( interp_b ) using the values from a second dataframe ( bait ). I want to look at each row's obs_datetime in interp_b , and determine when that plot-station-year was last baited, prior to the obs_datetime . This will later be used to calculate a time-since-bait for each obs_datetime . Bait times are in the bait dataframe in column bait_datetime . The results should go in a field called latestbait_datetime in the interp_b dataframe.

I was visualizing an iterative process where interp_b "latestbait_datetime" keeps getting recalculated until the last row in the bait dataframe is reached. The for-loop I tried is clearly running through the rows and doing the specified calculations but I can't seem to get the output in the format I want; it is producing output for each loop rather than rewriting and updating the interp_b dataframe.

Here is some code to build the two dataframes; interp_b and bait (please excuse the inelegance)

# interp_b dataframe----

   structure(list(plot_station_year = c("Cow_C2_2019", "RidingStable_C3_2018", 
"RidingStable_C3_2018", "Raf_C1_2018", "Metcalfe_C2_2019"), obs_datetime = structure(c(1559487600, 
1544954400, 1541084400, 1515160800, 1567756800), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), latestbait_datetime = structure(c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct", 
"POSIXt"))), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))

# bait dataframe----

    structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019", 
"RidingStable_C3_2018", "Raf_C1_2018"), bait_datetime = structure(c(1557500400, 
1559746800, 1543676400, 1491318000), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -4L), spec = structure(list(
    cols = list(plot_station_year = structure(list(), class = c("collector_character", 
    "collector")), bait_datetime = structure(list(format = "%d-%m-%Y %H:%M"), class = c("collector_datetime", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

and the desired result would look like this

Below are two of my attempts. The first resulted in a dataframe that only contained the final run of the loop and the second attempt resulted in a dataframe containing all of the run results (as expected with the bind).

library(tidyverse)

#attempt #1----
    for (i in 1:nrow(bait)) { 

  print(paste("row =",i))

  interpbait <- interp_b %>% 
    mutate(latestbait_datetime = if_else((plot_station_year == bait$plot_station_year[i] & (obs_datetime >= bait$bait_datetime[i] & (is.na(latestbait_datetime) | latestbait_datetime < bait$bait_datetime[i]))), bait$bait_datetime[i], latestbait_datetime))

}


#attempt #2----
    resultb <- data.frame()

for (i in 1:nrow(bait)) { 

  print(paste("row =",i))

  interpbait2 <- interp_b %>% 
    mutate(latestbait_datetime = if_else((plot_station_year == bait$plot_station_year[i] & (obs_datetime >= bait$bait_datetime[i] & (is.na(latestbait_datetime) | latestbait_datetime < bait$bait_datetime[i]))), bait$bait_datetime[i], latestbait_datetime))

  resultb <- bind_rows(resultb, interpbait2)

  print(resultb)
}

Any help would be greatly appreciated.

Answer 1

I'm not sure how long this will take, but here is a tidyverse solution. For each row in interp_b , we filter the bait dataframe to the correct plot_station_year , and ensure that all date-times are less than the row in interp_b . Then, we arrange the filtered bait data by descending datetime (so that the most recent dates are on top). We slice the first row of that dataframe so that we only get the most recent date. Then we "pull out" the date-time from the dataframe, and add it onto the appropriate row in interp_b .

library(tidyverse)
library(progress) # for progress bar

# create progress bar to update, so that you can estimate the amount of time it will take to finish the entire loop
pb <- progress_bar$new(total = nrow(interp_b))

for (i in 1:nrow(interp_b)) {

  last_time_baited <- bait %>% 
    #filter bait dataframe to appropriate plot, station, year based on
    # the row in interp_b
    filter(plot_station_year == interp_b$plot_station_year[i],
           # ensure all datetimes are less than that row in interp_b
           bait_datetime < interp_b$obs_datetime[i]) %>% 
    # arrange by datetime (most recent datetimes first)
    arrange(desc(bait_datetime)) %>% 
    # take the top row - this will be the most recent date-time that
    # the plot-station was baited
    slice(1) %>% 
    # "pull" that value out of the dataframe so you have a value, 
    # not a tibble
    pull(bait_datetime) # 

  # update the row in interp_b with the date_time baited
  interp_b$latestbait_datetime[i] <- last_time_baited

  pb$tick() # print progress
}

The resulting table matches your expected output ( interp_b ):

# A tibble: 5 x 3
  plot_station_year    obs_datetime        latestbait_datetime
  <chr>                <dttm>              <dttm>             
1 Cow_C2_2019          2019-06-02 15:00:00 2019-05-10 11:00:00
2 RidingStable_C3_2018 2018-12-16 10:00:00 2018-12-01 10:00:00
3 RidingStable_C3_2018 2018-11-01 15:00:00 NA                 
4 Raf_C1_2018          2018-01-05 14:00:00 2017-04-04 11:00:00
5 Metcalfe_C2_2019     2019-09-06 08:00:00 NA

Answer 2

You could perform an outer join with data.table , and then select the highest bait_datetime for each plot_station_year.

Edit : I edited my answer to reflect the possibility that there could be multiple obs_datetime for a given unique plot_station_year in interp2 . To preserve these, we index them and include the index in the filtering step.

One potential improvement with large files (not tested) could be to merge using roll , instead of performing an outer merge and then to filter.

That version is shown in the end of the reproducible example:

library(data.table)

interp2 <- structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019", "RidingStable_C3_2018", 
    "Raf_C1_2018", "Metcalfe_C2_2019"), obs_datetime = structure(c(1559487600, 1559487300,
        1544954400, 1515160800, 1567756800), class = c("POSIXct", "POSIXt"
        ), tzone = "UTC"), latestbait_datetime = structure(c(NA_real_, 
            NA_real_, NA_real_, NA_real_), class = c("POSIXct", "POSIXt"))), class = c("spec_tbl_df", 
                "tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))

bait2 <- structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019",  "Cow_C2_2019",
    "RidingStable_C3_2018", "Raf_C1_2018"), bait_datetime = structure(c(1557500400, 
        1496674800, 1576674800, 1543676400, 1491318000), class = c("POSIXct", "POSIXt"
        ), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", "tbl", 
            "data.frame"), row.names = c(NA, -5L), spec = structure(list(
                cols = list(plot_station_year = structure(list(), class = c("collector_character", 
                    "collector")), bait_datetime = structure(list(format = "%d-%m-%Y %H:%M"), class = c("collector_datetime", 
                        "collector"))), default = structure(list(), class = c("collector_guess", 
                            "collector")), skip = 1), class = "col_spec"))


# add index idx by plot_station_year, remove empty column, set keys
setDT(interp2)[, "latestbait_datetime" := NULL][, idx := 1:.N, by=plot_station_year]
setkeyv(interp2, c("plot_station_year", "idx", "obs_datetime"))

# same for bait2: set as data.table, set keys
setDT(bait2, key=c("plot_station_year", "bait_datetime"))

## option 1: merge files, then filter
# outer join on interp2 and bait2 on first column (and order by bait_datetime)
expected_out <- merge(interp2, bait2, by="plot_station_year", all=TRUE)

# set keys for sorting
setkey(expected_out, plot_station_year, idx, bait_datetime)

# select highest bait_datetime below obs_datetime by plot_station_year and idx
expected_out <- expected_out[is.na(bait_datetime) | bait_datetime < obs_datetime][,
    tail(.SD, 1), by=.(plot_station_year, idx)]

# rename and sort columns
setnames(expected_out, old="bait_datetime", new="latestbait_datetime")
setorder(expected_out, -latestbait_datetime, idx, na.last = TRUE)[]
#>       plot_station_year idx        obs_datetime latestbait_datetime
#> 1:          Cow_C2_2019   1 2019-06-02 15:00:00 2019-05-10 15:00:00
#> 2:          Cow_C2_2019   2 2019-06-02 14:55:00 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018   1 2018-12-16 10:00:00 2018-12-01 15:00:00
#> 4:          Raf_C1_2018   1 2018-01-05 14:00:00 2017-04-04 15:00:00
#> 5:     Metcalfe_C2_2019   1 2019-09-06 08:00:00                <NA>


## option 2 (might use less memory): rolling join

bait2[, latestbait_datetime := bait_datetime]
out_alt <- bait2[interp2, .(plot_station_year, obs_datetime, idx, latestbait_datetime), 
    on=c("plot_station_year", "bait_datetime==obs_datetime"), roll=Inf]

# order
setorder(out_alt, -latestbait_datetime, idx, na.last = TRUE)[]
#>       plot_station_year        obs_datetime idx latestbait_datetime
#> 1:          Cow_C2_2019 2019-06-02 15:00:00   1 2019-05-10 15:00:00
#> 2:          Cow_C2_2019 2019-06-02 14:55:00   2 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018 2018-12-16 10:00:00   1 2018-12-01 15:00:00
#> 4:          Raf_C1_2018 2018-01-05 14:00:00   1 2017-04-04 15:00:00
#> 5:     Metcalfe_C2_2019 2019-09-06 08:00:00   1                <NA>
setcolorder(out_alt, c(1,3,2,4))[]
#>       plot_station_year idx        obs_datetime latestbait_datetime
#> 1:          Cow_C2_2019   1 2019-06-02 15:00:00 2019-05-10 15:00:00
#> 2:          Cow_C2_2019   2 2019-06-02 14:55:00 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018   1 2018-12-16 10:00:00 2018-12-01 15:00:00
#> 4:          Raf_C1_2018   1 2018-01-05 14:00:00 2017-04-04 15:00:00
#> 5:     Metcalfe_C2_2019   1 2019-09-06 08:00:00                <NA>

## test that both options give the same result:

identical(expected_out, out_alt)
#> [1] TRUE

R: Loop through a set of values in one dataframe update a second dataframe

Question

2 answers

solution1
1 ACCPTED 2020-03-31 17:05:51

solution2
0 2020-02-27 22:04:41

R: Loop through a set of values in one dataframe update a second dataframe

Question

2 answers

solution1 1 ACCPTED 2020-03-31 17:05:51

solution2 0 2020-02-27 22:04:41

solution1
1 ACCPTED 2020-03-31 17:05:51

solution2
0 2020-02-27 22:04:41