[英]Create New Column Based on Previous Row and Multiple Conditions in R
I have the following sample data frame: 我有以下示例数据框:
x
date product release
2012-01-01 A 0
2012-01-02 A 0
2012-01-03 A 0
2012-01-04 A 1
2012-01-05 A 0
2012-01-06 A 0
2012-01-07 A 0
2012-01-08 A 0
2012-01-09 A 0
2012-01-10 A 0
2012-01-11 A 0
2012-01-12 A 0
2012-01-01 Z 0
2012-01-02 Z 1
2012-01-03 Z 0
2012-01-04 Z 0
2012-01-05 Z 0
2012-01-06 Z 0
2012-01-07 Z 0
I want to iterate through each row and generate a dayssince column based on how many days it's been since the release. 我想遍历每一行,并根据自发布以来已经过多少天来生成dayssince列。
Few things to keep in mind: 请记住以下几点:
- new product released = 1 no product released = 0 -新产品发布= 1没有产品发布= 0
- the output needs to be unique to the date and the product -输出对于日期和产品必须是唯一的
The desired output would be: 所需的输出将是:
x
date product release dayssince
2012-01-01 A 0 0
2012-01-02 A 0 0
2012-01-03 A 0 0
2012-01-04 A 1 1
2012-01-05 A 0 2
2012-01-06 A 0 3
2012-01-07 A 0 4
2012-01-08 A 0 5
2012-01-09 A 0 6
2012-01-10 A 0 7
2012-01-11 A 0 8
2012-01-12 A 0 9
2012-01-01 Z 0 0
2012-01-02 Z 1 1
2012-01-03 Z 0 2
2012-01-04 Z 0 3
2012-01-05 Z 0 4
2012-01-06 Z 0 5
2012-01-07 Z 0 6
I've tried everything I could think of from ifelse statements and for loops to ddply. 从ifelse语句和for循环到ddply,我已经尝试了所有可能想到的方法。
The simplest way I've been able to approach the problem is to do the following conceptually: 解决问题的最简单方法是从概念上进行以下操作:
x$dayssince <- ifelse(x$release > 0, 1, 0)
- Then check each row in dayssince. -然后从几天开始检查每一行。
- If dayssince == 1, then 1 -如果dayssince == 1,则为1
- If dayssince < 1, then check row above. -如果dayssince <1,则检查上面的行。
- If row above is > 0 , then use value of row above + 1 -如果上方的行> 0,则使用上方的行+ 1
- All this unique to the product. -所有这些都是产品独有的。
Thank you in advance! 先感谢您!
For the same products that release multiple times per year, I'm looking to get the number of days since the last release . 对于每年发布多次的相同产品,我希望获得自上次发布以来的天数。
For example: 例如:
x
date product release dayssince
2012-01-01 A 0 0
2012-01-02 A 0 0
2012-01-03 A 0 0
2012-01-04 A 1 1
2012-01-05 A 0 2
2012-01-06 A 0 3
2012-01-07 A 0 4
2012-01-08 A 0 5
2012-01-09 A 0 6
2012-01-10 A 1 1
2012-01-11 A 0 2
2012-01-12 A 0 3
2012-01-13 A 0 4
2012-01-14 A 0 5
etc... Thanks for the flag @DMC 等等...感谢@DMC标志
You can try using ave
from base R
您可以尝试从
base R
使用ave
x$dayssince <- with(x, ave(release, cumsum(release), product,
FUN=function(y) cumsum(cumsum(y))))
Or using data.table
或使用
data.table
library(data.table)
setDT(x)[,dayssince:=cumsum(cumsum(release)) ,
.(product,cumsum(release))][]
# 1: 2012-01-01 A 0 0
# 2: 2012-01-02 A 0 0
# 3: 2012-01-03 A 0 0
# 4: 2012-01-04 A 1 1
# 5: 2012-01-05 A 0 2
# 6: 2012-01-06 A 0 3
# 7: 2012-01-07 A 0 4
# 8: 2012-01-08 A 0 5
# 9: 2012-01-09 A 0 6
# 10: 2012-01-10 A 1 1
# 11: 2012-01-11 A 0 2
# 12: 2012-01-12 A 0 3
# 13: 2012-01-01 Z 0 0
# 14: 2012-01-02 Z 1 1
# 15: 2012-01-03 Z 0 2
# 16: 2012-01-04 Z 0 3
# 17: 2012-01-05 Z 0 4
# 18: 2012-01-06 Z 0 5
# 19: 2012-01-07 Z 0 6
The solution uses dplyr
and creates an intermediate variable release_num
: 该解决方案使用
dplyr
并创建一个中间变量release_num
:
library(dplyr)
x %>%
group_by(product) %>%
mutate(release_num = cumsum(release)) %>%
group_by(product, release_num) %>%
mutate(dayssince = cumsum(cumsum(release)))
One comment that I have is that you ask for a solution that 'iterates row-by-row.' 我要说的一句话是,您要求一种“逐行迭代”的解决方案。 This isn't an R way of doing things.
这不是R的做事方式。 R works on vectors--typically column vectors.
R适用于向量-通常是列向量。 Therefore, any solution will require a bit of a workaround.
因此,任何解决方案都需要一些解决方法。 You could switch to something like SAS which does explicitly work row-wise.
您可以切换到类似SAS的类 ,它确实可以逐行工作。
My solution uses the plyr
library, although it's not vectorized. 我的解决方案使用
plyr
库,尽管它没有向量化。 It may therefore be slower than some alternatives. 因此,它可能比某些替代方案要慢。
# given vector of release dates and output vector, produce "dayssince"
ds <- function(rel.dts, x) {
n <- length(rel.dts)
x[1:rel.dts[1]] <- 0
for (i in 2:n) {
x[(rel.dts[i-1]):(rel.dts[i]-1)] <- 0:(rel.dts[i]-rel.dts[i-1]-1)
}
x[rel.dts[n]:length(x)] <- 0:(length(x)-rel.dts[n])
return(x)
}
# use ds() on a given product
ds.prod <- function(dat) {
dat <- dat[order(dat$date, decreasing=FALSE),]
rel.dts <- which(dat$release ==1)
ds <- get("ds")
dat$daysince <- ds(rel.dts, x=vector("integer", length= nrow(dat)))
return(dat)
}
# split by product and run
require(plyr)
dat <- ddply(dat, .var="product", .fun= ds.prod)
If your data is coming from a database, it may be easier to create a view with a computed column used to calculate the days since release. 如果您的数据来自数据库,则创建带有用于计算发布以来天数的计算列的视图可能会更容易。
I am currently too tired to post any SQL code, but if it is an approach you would consider, I can provide some example code tomorrow. 我目前不愿意发布任何SQL代码,但是如果您考虑采用这种方法,我明天可以提供一些示例代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.