简体   繁体   中英

Create New Column Based on Previous Row and Multiple Conditions in R

I have the following sample data frame:

date          product   release    
2012-01-01    A         0                   
2012-01-02    A         0                   
2012-01-03    A         0                   
2012-01-04    A         1 
2012-01-05    A         0     
2012-01-06    A         0   
2012-01-07    A         0   
2012-01-08    A         0   
2012-01-09    A         0   
2012-01-10    A         0   
2012-01-11    A         0   
2012-01-12    A         0 
2012-01-01    Z         0                   
2012-01-02    Z         1                   
2012-01-03    Z         0                   
2012-01-04    Z         0   
2012-01-05    Z         0     
2012-01-06    Z         0   
2012-01-07    Z         0 

I want to iterate through each row and generate a dayssince column based on how many days it's been since the release.

Few things to keep in mind:
- new product released = 1 no product released = 0
- the output needs to be unique to the date and the product

The desired output would be:

    date      product   release    dayssince  
    2012-01-01    A         0          0         
    2012-01-02    A         0          0        
    2012-01-03    A         0          0        
    2012-01-04    A         1          1
    2012-01-05    A         0          2
    2012-01-06    A         0          3
    2012-01-07    A         0          4
    2012-01-08    A         0          5
    2012-01-09    A         0          6
    2012-01-10    A         0          7
    2012-01-11    A         0          8
    2012-01-12    A         0          9
    2012-01-01    Z         0          0        
    2012-01-02    Z         1          1        
    2012-01-03    Z         0          2        
    2012-01-04    Z         0          3
    2012-01-05    Z         0          4
    2012-01-06    Z         0          5
    2012-01-07    Z         0          6

I've tried everything I could think of from ifelse statements and for loops to ddply.

The simplest way I've been able to approach the problem is to do the following conceptually:

x$dayssince <- ifelse(x$release > 0, 1, 0)

- Then check each row in dayssince.
- If dayssince == 1, then 1
- If dayssince < 1, then check row above.
- If row above is > 0 , then use value of row above + 1
- All this unique to the product.

Thank you in advance!


For the same products that release multiple times per year, I'm looking to get the number of days since the last release .

For example:

    date      product   release    dayssince  
    2012-01-01    A         0          0         
    2012-01-02    A         0          0        
    2012-01-03    A         0          0        
    2012-01-04    A         1          1
    2012-01-05    A         0          2
    2012-01-06    A         0          3
    2012-01-07    A         0          4
    2012-01-08    A         0          5
    2012-01-09    A         0          6
    2012-01-10    A         1          1
    2012-01-11    A         0          2
    2012-01-12    A         0          3
    2012-01-13    A         0          4
    2012-01-14    A         0          5

etc... Thanks for the flag @DMC

You can try using ave from base R

 x$dayssince <-  with(x, ave(release, cumsum(release), product, 
                          FUN=function(y) cumsum(cumsum(y))))

Or using data.table

setDT(x)[,dayssince:=cumsum(cumsum(release)) ,
 #  1: 2012-01-01       A       0         0
 #  2: 2012-01-02       A       0         0
 #  3: 2012-01-03       A       0         0
 #  4: 2012-01-04       A       1         1
 #  5: 2012-01-05       A       0         2
 #  6: 2012-01-06       A       0         3
 #  7: 2012-01-07       A       0         4
 #  8: 2012-01-08       A       0         5
 #  9: 2012-01-09       A       0         6
 # 10: 2012-01-10       A       1         1
 # 11: 2012-01-11       A       0         2
 # 12: 2012-01-12       A       0         3
 # 13: 2012-01-01       Z       0         0
 # 14: 2012-01-02       Z       1         1
 # 15: 2012-01-03       Z       0         2
 # 16: 2012-01-04       Z       0         3
 # 17: 2012-01-05       Z       0         4
 # 18: 2012-01-06       Z       0         5
 # 19: 2012-01-07       Z       0         6

The solution uses dplyr and creates an intermediate variable release_num :


x %>%
  group_by(product) %>%
  mutate(release_num = cumsum(release)) %>%
  group_by(product, release_num) %>%
  mutate(dayssince = cumsum(cumsum(release)))

One comment that I have is that you ask for a solution that 'iterates row-by-row.' This isn't an R way of doing things. R works on vectors--typically column vectors. Therefore, any solution will require a bit of a workaround. You could switch to something like SAS which does explicitly work row-wise.

My solution uses the plyr library, although it's not vectorized. It may therefore be slower than some alternatives.

# given vector of release dates and output vector, produce "dayssince"
ds <- function(rel.dts, x) {
  n <- length(rel.dts)
  x[1:rel.dts[1]] <- 0
  for (i in 2:n) {
    x[(rel.dts[i-1]):(rel.dts[i]-1)] <- 0:(rel.dts[i]-rel.dts[i-1]-1)
  x[rel.dts[n]:length(x)] <- 0:(length(x)-rel.dts[n])

# use ds() on a given product
ds.prod <- function(dat) {
  dat <- dat[order(dat$date, decreasing=FALSE),]
  rel.dts <- which(dat$release ==1)
  ds <- get("ds")
  dat$daysince <- ds(rel.dts, x=vector("integer", length= nrow(dat)))

# split by product and run
dat <- ddply(dat, .var="product", .fun= ds.prod)

If your data is coming from a database, it may be easier to create a view with a computed column used to calculate the days since release.

I am currently too tired to post any SQL code, but if it is an approach you would consider, I can provide some example code tomorrow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM