简体   繁体   English

根据R中的上一行和多个条件创建新列

[英]Create New Column Based on Previous Row and Multiple Conditions in R

I have the following sample data frame: 我有以下示例数据框:

x
date          product   release    
2012-01-01    A         0                   
2012-01-02    A         0                   
2012-01-03    A         0                   
2012-01-04    A         1 
2012-01-05    A         0     
2012-01-06    A         0   
2012-01-07    A         0   
2012-01-08    A         0   
2012-01-09    A         0   
2012-01-10    A         0   
2012-01-11    A         0   
2012-01-12    A         0 
2012-01-01    Z         0                   
2012-01-02    Z         1                   
2012-01-03    Z         0                   
2012-01-04    Z         0   
2012-01-05    Z         0     
2012-01-06    Z         0   
2012-01-07    Z         0 

I want to iterate through each row and generate a dayssince column based on how many days it's been since the release. 我想遍历每一行,并根据自发布以来已经过多少天来生成dayssince列。

Few things to keep in mind: 请记住以下几点:
- new product released = 1 no product released = 0 -新产品发布= 1没有产品发布= 0
- the output needs to be unique to the date and the product -输出对于日期产品必须是唯一的

The desired output would be: 所需的输出将是:

   x
    date      product   release    dayssince  
    2012-01-01    A         0          0         
    2012-01-02    A         0          0        
    2012-01-03    A         0          0        
    2012-01-04    A         1          1
    2012-01-05    A         0          2
    2012-01-06    A         0          3
    2012-01-07    A         0          4
    2012-01-08    A         0          5
    2012-01-09    A         0          6
    2012-01-10    A         0          7
    2012-01-11    A         0          8
    2012-01-12    A         0          9
    2012-01-01    Z         0          0        
    2012-01-02    Z         1          1        
    2012-01-03    Z         0          2        
    2012-01-04    Z         0          3
    2012-01-05    Z         0          4
    2012-01-06    Z         0          5
    2012-01-07    Z         0          6

I've tried everything I could think of from ifelse statements and for loops to ddply. 从ifelse语句和for循环到ddply,我已经尝试了所有可能想到的方法。

The simplest way I've been able to approach the problem is to do the following conceptually: 解决问题的最简单方法是从概念上进行以下操作:

x$dayssince <- ifelse(x$release > 0, 1, 0)

- Then check each row in dayssince. -然后从几天开始检查每一行。
- If dayssince == 1, then 1 -如果dayssince == 1,则为1
- If dayssince < 1, then check row above. -如果dayssince <1,则检查上面的行。
- If row above is > 0 , then use value of row above + 1 -如果上方的行> 0,则使用上方的行+ 1
- All this unique to the product. -所有这些都是产品独有的。

Thank you in advance! 先感谢您!

UPDATE/CLARIFICATION: UPDATE /澄清:

For the same products that release multiple times per year, I'm looking to get the number of days since the last release . 对于每年发布多次的相同产品,我希望获得自上次发布以来的天数。

For example: 例如:

    x
    date      product   release    dayssince  
    2012-01-01    A         0          0         
    2012-01-02    A         0          0        
    2012-01-03    A         0          0        
    2012-01-04    A         1          1
    2012-01-05    A         0          2
    2012-01-06    A         0          3
    2012-01-07    A         0          4
    2012-01-08    A         0          5
    2012-01-09    A         0          6
    2012-01-10    A         1          1
    2012-01-11    A         0          2
    2012-01-12    A         0          3
    2012-01-13    A         0          4
    2012-01-14    A         0          5

etc... Thanks for the flag @DMC 等等...感谢@DMC标志

You can try using ave from base R 您可以尝试从base R使用ave

 x$dayssince <-  with(x, ave(release, cumsum(release), product, 
                          FUN=function(y) cumsum(cumsum(y))))

Or using data.table 或使用data.table

library(data.table)
setDT(x)[,dayssince:=cumsum(cumsum(release)) ,
                   .(product,cumsum(release))][]
 #  1: 2012-01-01       A       0         0
 #  2: 2012-01-02       A       0         0
 #  3: 2012-01-03       A       0         0
 #  4: 2012-01-04       A       1         1
 #  5: 2012-01-05       A       0         2
 #  6: 2012-01-06       A       0         3
 #  7: 2012-01-07       A       0         4
 #  8: 2012-01-08       A       0         5
 #  9: 2012-01-09       A       0         6
 # 10: 2012-01-10       A       1         1
 # 11: 2012-01-11       A       0         2
 # 12: 2012-01-12       A       0         3
 # 13: 2012-01-01       Z       0         0
 # 14: 2012-01-02       Z       1         1
 # 15: 2012-01-03       Z       0         2
 # 16: 2012-01-04       Z       0         3
 # 17: 2012-01-05       Z       0         4
 # 18: 2012-01-06       Z       0         5
 # 19: 2012-01-07       Z       0         6

The solution uses dplyr and creates an intermediate variable release_num : 该解决方案使用dplyr并创建一个中间变量release_num

library(dplyr)

x %>%
  group_by(product) %>%
  mutate(release_num = cumsum(release)) %>%
  group_by(product, release_num) %>%
  mutate(dayssince = cumsum(cumsum(release)))

One comment that I have is that you ask for a solution that 'iterates row-by-row.' 我要说的一句话是,您要求一种“逐行迭代”的解决方案 This isn't an R way of doing things. 这不是R的做事方式。 R works on vectors--typically column vectors. R适用于向量-通常是列向量。 Therefore, any solution will require a bit of a workaround. 因此,任何解决方案都需要一些解决方法。 You could switch to something like SAS which does explicitly work row-wise. 您可以切换到类似SAS的类 ,它确实可以逐行工作。

My solution uses the plyr library, although it's not vectorized. 我的解决方案使用plyr库,尽管它没有向量化。 It may therefore be slower than some alternatives. 因此,它可能比某些替代方案要慢。

# given vector of release dates and output vector, produce "dayssince"
ds <- function(rel.dts, x) {
  n <- length(rel.dts)
  x[1:rel.dts[1]] <- 0
  for (i in 2:n) {
    x[(rel.dts[i-1]):(rel.dts[i]-1)] <- 0:(rel.dts[i]-rel.dts[i-1]-1)
  }
  x[rel.dts[n]:length(x)] <- 0:(length(x)-rel.dts[n])
  return(x)
}

# use ds() on a given product
ds.prod <- function(dat) {
  dat <- dat[order(dat$date, decreasing=FALSE),]
  rel.dts <- which(dat$release ==1)
  ds <- get("ds")
  dat$daysince <- ds(rel.dts, x=vector("integer", length= nrow(dat)))
  return(dat)
}

# split by product and run
require(plyr)
dat <- ddply(dat, .var="product", .fun= ds.prod)

If your data is coming from a database, it may be easier to create a view with a computed column used to calculate the days since release. 如果您的数据来自数据库,则创建带有用于计算发布以来天数的计算列的视图可能会更容易。

I am currently too tired to post any SQL code, but if it is an approach you would consider, I can provide some example code tomorrow. 我目前不愿意发布任何SQL代码,但是如果您考虑采用这种方法,我明天可以提供一些示例代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM