简体   繁体   中英

Creating variable in R data frame depending on another data frame

I am seeking help after having wasted almost a day. I have a big data frame (bdf) and a small data frame (sdf). I want to add variable z to bdf depending on the value of sdf$y (which changes as a function of a time variable).

Here is a reproducible example:

bdf <- data.frame(tb = seq(as.POSIXct("2013-05-19 17:11:22 GMT", tz="GMT"), by=5624*24, length.out=10))

bdf
                tb
1  2013-05-19 17:11:22
2  2013-05-21 06:40:58
3  2013-05-22 20:10:34
4  2013-05-24 09:40:10
5  2013-05-25 23:09:46
6  2013-05-27 12:39:22
7  2013-05-29 02:08:58
8  2013-05-30 15:38:34
9  2013-06-01 05:08:10
10 2013-06-02 18:37:46


sdf <- data.frame(ts = as.POSIXct(c("2013-05-22", "2013-05-25", "2013-05-30"), tz="GMT"), y = c(0.2, -0.1, 0.3))

> sdf
      ts    y
1 2013-05-22  0.2
2 2013-05-25 -0.1
3 2013-05-30  0.3

I want to create variable z in bdf with the following values of sdf$y:

  • 0.2 for rows where bdf$tb ranges from the first bdf$tb value to mid-way between the 1st and 2nd value of sdf$ts. In this simple example, that is the case of rows 1 to 3 of dbf which have times bdf$tb below "2013-05-23 12:00:00 GMT".

  • -0.1 for rows where bdf$tb ranges from mid-way between the 1st and 2nd value of sdf$ts to mid-way between the 2nd and 3rd value of sdf$ts. In this simple example, that is the case of rows 4 and 5 of dbf which have times bdf$tb between "2013-05-23 12:00:00 GMT" and "2013-05-27 12:00:00 GMT".

  • 0.3 for all rows where bdf$tb ranges from mid-way between the 2nd and 3rd value of sdf$ts to the last value of bdf$tb. In this simple example, that is the case of rows 1 to 6 to 10 of dbf which have times larger than "2013-05-23 12:00:00 GMT".

Hence, in the end, the big dataframe bdf should look like this:

                 tb    z
1  2013-05-19 17:11:22  0.2
2  2013-05-21 06:40:58  0.2
3  2013-05-22 20:10:34  0.2
4  2013-05-24 09:40:10 -0.1
5  2013-05-25 23:09:46 -0.1
6  2013-05-27 12:39:22  0.3
7  2013-05-29 02:08:58  0.3
8  2013-05-30 15:38:34  0.3
9  2013-06-01 05:08:10  0.3
10 2013-06-02 18:37:46  0.3

I could not succeed using dplyr::mutate and got nowhere using loops... Any help would be much appreciated. I hope that I clearly described the issue as adhered to the etiquette (it is my first question).

Here's a solution using data.table 's rolling joins :

require(data.table)
setkey(setDT(sdf), ts)
sdf[bdf, roll = "nearest"]
#                      ts    y
#  1: 2013-05-19 17:11:22  0.2
#  2: 2013-05-21 06:40:58  0.2
#  3: 2013-05-22 20:10:34  0.2
#  4: 2013-05-24 09:40:10 -0.1
#  5: 2013-05-25 23:09:46 -0.1
#  6: 2013-05-27 12:39:22  0.3
#  7: 2013-05-29 02:08:58  0.3
#  8: 2013-05-30 15:38:34  0.3
#  9: 2013-06-01 05:08:10  0.3
# 10: 2013-06-02 18:37:46  0.3
  • setDT converts data.frame to data.table by reference .

  • setkey sorts the data.table by reference in increasing order by the columns provided, and marks those columns as key columns (so that we can join on those key columns later.

  • In data.table, x[i] performs a join when i is a data.table. I'll refer you to this answer to catch up on data.table joins, if you're not already familiar with.

  • x[i] performs an equi-join . That is, it finds matching row indices in x for every row in i and then extracts those rows from x to return the join result along with the corresponding row from i . In case a row in i doesn't find matching row indices in x , that row would have NA for x by default.

    However, x[i, roll = .] performs a rolling join . When there's no match, either the last observation is carried forward ( roll = TRUE or -Inf ), or the next observation can be carried backward ( roll = Inf ), or rolled to the nearest value ( roll = "nearest" ). And in this case you require roll = "nearest" IIUC.

HTH

Here's my approach:

library(zoo)
m <- c(rollmean(as.POSIXct(sdf$ts), 2), Inf)
transform(bdf, z = sdf$y[sapply(tb, function(x) which.max(x < m))])
#                    tb    z
#1  2013-05-19 17:11:22  0.2
#2  2013-05-21 06:40:58  0.2
#3  2013-05-22 20:10:34  0.2
#4  2013-05-24 09:40:10 -0.1
#5  2013-05-25 23:09:46 -0.1
#6  2013-05-27 12:39:22  0.3
#7  2013-05-29 02:08:58  0.3
#8  2013-05-30 15:38:34  0.3
#9  2013-06-01 05:08:10  0.3
#10 2013-06-02 18:37:46  0.3

Update: removed conversion to numeric (not required)

Brief explanation:

  • as.POSIXct(sdf$ts) converts the dates to POSIXct-style date-times
  • rollmean(as.POSIXct(sdf$ts), 2) computes the rolling mean of each two consecutive rows. This happens to be exactly the time you want to use for separating the observations. rollmean is from package zoo . Computing a rollmean(..,2) means the output vector is shortened by 1 compared to the input vector.
  • That is why I wrap the result of rollmean in c(.., Inf) which means that the infinity value is added to the rollmean vector as the last value. This will ensure that the last entries of z in sdf are also returned (0.3 in the specific example).
  • I use transform to add the z column to bdf
  • sapply(tb, function(x) which.max(x < m)) loops through the entries in bdf$tb and for each entry, computes the maximum index for which bdf$tb is less (earlier) than m (which holds the vector of rollmean entries). Only the maximum (latest) index is returned for each bdf$tb entry.
  • That vector of indices is used in sdf$y[sapply(tb, function(x) which.max(x < m))] to extract the corresponding elements of sdf$y which will then be stored/copied to the new z column in bdf

Hope that helps

Edit note: I initially get a slightly different result than you did which I now think was related to my lack of understanding of R difftime objects. Timezones in POSIXt objects also remain a mystery to me but I now see that when I coerced a 'difftime' object to 'numeric' that I got the value in "days".

The findInterval function is very useful as an index creation function that maps a values-vector where one has multiple adjoining non overlapping intervals. You really only have two time-points that split into three intervals.

bdf$z <- c(0.2,-0.1,0.3)[findInterval(bdf$tb, 
                c(-Inf, 
  sdf$ts[2] - 0.5*as.numeric(difftime(sdf$ts[2], sdf$ts[1], units="secs")), 
  sdf$ts[3] - 0.5*as.numeric(difftime(sdf$ts[3], sdf$ts[2],units="sec")), 
                 Inf))]

> bdf
                    tb    z
1  2013-05-19 17:11:22  0.2
2  2013-05-21 06:40:58  0.2
3  2013-05-22 20:10:34  0.2
4  2013-05-24 09:40:10 -0.1
5  2013-05-25 23:09:46 -0.1
6  2013-05-27 12:39:22  0.3
7  2013-05-29 02:08:58  0.3
8  2013-05-30 15:38:34  0.3
9  2013-06-01 05:08:10  0.3
10 2013-06-02 18:37:46  0.3

I also checked to see if my result would be affected by whether the intervals in findIntervals were closed on their right rather than the left (default) and saw no difference.

This seems now absolutely unnecessary, but in base R

bdf$z <- numeric(nrow(bdf))
for(i in seq_along(bdf$z)){
  ind <- which.min(abs(bdf$tb[i] - sdf$ts))
  bdf$z[i] <- sdf$y[ind]
}

While being little clumsy, it has an advantage in clarity, which accomodates easy adaptation to dplyr

library(dplyr)
bdf %>% rowwise() %>% 
  mutate(z= sdf$y[which.min(abs(as.numeric(tb)-as.numeric(sdf$ts)))])

#Source: local data frame [10 x 2]
#Groups: <by row>

#                    tb    z
#1  2013-05-19 17:11:22  0.2
#2  2013-05-21 06:40:58  0.2
#3  2013-05-22 20:10:34  0.2
#4  2013-05-24 09:40:10 -0.1
#5  2013-05-25 23:09:46 -0.1
#6  2013-05-27 12:39:22  0.3
#7  2013-05-29 02:08:58  0.3
#8  2013-05-30 15:38:34  0.3
#9  2013-06-01 05:08:10  0.3
#10 2013-06-02 18:37:46  0.3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM