I need to adjust a code, which works perfectly with my dataframe (but with another set up), in order to select a 2 days time window from the column Day. In particular I am interested in the 1 day prior day0 (ie i - 1 and i, where i is the day of interest) and its (i - 1) values contained in the column Count have to be added into the day 0 (i) col Count.
Here an example of my dataframe:
df <- read.table(text = "
Station Day Count
1 33012 12448 4
2 35004 12448 4
3 35008 12448 4
4 37006 12448 4
5 21009 4835 3
6 24005 4835 3
7 27001 4835 3
8 25005 12447 3
9 29001 12447 3
10 29002 12447 3
11 29002 12446 3
12 30001 12446 3
13 31002 12446 3
14 47007 4834 2
15 49002 4834 2
16 47004 12445 1
17 51001 12449 1
18 51003 4832 1
19 52004 4836 1", header = TRUE)
my output should be:
Station Day Count
1 33012 12448 7
2 35004 12448 7
3 35008 12448 7
4 37006 12448 7
5 21009 4835 5
6 24005 4835 5
7 27001 4835 5
8 29002 12446 4
9 30001 12446 4
10 31002 12446 4
11 51001 12449 1
12 51003 4832 1
13 52004 4836 1
14 25005 12447 0
15 29001 12447 0
16 29002 12447 0
17 47007 4834 0
18 49002 4834 0
19 47004 12445 0
I am trying this code, but it doesn't work with my real dataframe:
for (i in unique(df$Day)) {
temp <- df$Count[df$Day == i]
if(length(temp > 0)) {
condition1 <- df$Day == i - 1
if (any(condition1)) {
df$Count[df$Day == i] <- mean(df$Count[condition1]) + df$Count[df$Day == i]
df$Count[condition1] <- 0
}
}
}
The code seems right and it has sense but my output is not.
Can anyone helps me?
@aichao code works good.
In the case that I want to consider the previous 30 days (ie day-30, day-29, day-28, ...., day-1, day0) is there any quick way to do it, instead of creating 30 if statements (conditions)?
Thanks again @aichao for your help.
The following does what you want on the sample data you gave
for (i in unique(df$Day)) {
temp <- df$Count[df$Day == i]
if (any(temp > 0)) {
condition1 <- df$Day == i - 1
condition1[which(df$Day == i - 1) < max(which(df$Day == i))] <- FALSE
if (any(condition1)) {
df$Count[df$Day == i] <- mean(df$Count[condition1]) + df$Count[df$Day == i]
df$Count[condition1] <- 0
}
}
}
print(df[order(df$Count, decreasing = TRUE),])
## Station Day Count
##1 33012 12448 7
##2 35004 12448 7
##3 35008 12448 7
##4 37006 12448 7
##5 21009 4835 5
##6 24005 4835 5
##7 27001 4835 5
##11 29002 12446 4
##12 30001 12446 4
##13 31002 12446 4
##17 51001 12449 1
##18 51003 4832 1
##19 52004 4836 1
##8 25005 12447 0
##9 29001 12447 0
##10 29002 12447 0
##14 47007 4834 0
##15 49002 4834 0
##16 47004 12445 0
A key requirement gleamed from your comment that was missing from your implementation is that only days that are further down the data frame (in rows) are considered in determining the previous day and its count. That is, you are processing the data frame rows as if they were ordered in time and not considering the values in the Day
column as an ordering of time. Therefore, for df$Day = 12449
there is no previous day to consider since all rows with df$Day = 12448
precedes it. As a result, the Count
for df$Day = 12449
remains at 1
, and more importantly, the Counts
for all rows that have df$Day = 12448
are not to be zeroed out after processing df$Day = 12449
.
To implement this, we need to further filter condition1
so that we set to FALSE
all rows for which df$Day == i - 1
(previous day) that precedes the highest row for which df$Day == i
(day of interest) using the line
condition1[which(df$Day == i - 1) < max(which(df$Day == i))] <- FALSE
Note that this solution assumes that same values for the Day
column in the data frame are lumped together as blocks of rows as is in your sample data. Otherwise, your for
loop over unique(df$Day)
needs to be reconsidered completely and replaced with a loop over rows in order to track the current row for the day of interest in the data frame.
In addition, a minor bug in your code was in the line
if(length(temp > 0)) {
The intent was to check if there are any rows for which the Count
is greater than 0
for the day of interest. However, conditional operators in R are vectorized such that temp > 0
returns a vector of booleans that is the same length as its input temp
. Therefore, length(temp > 0)
will always return a positive number unless temp
itself is of length 0
(ie, empty). To get what you intend, the line is changed to
if(any(temp > 0)) {
Update: new requirement regarding multiple previous days
The simplest way to address the new requirement is to put the body of code within the if (any(temp > 0)) {...}
block into a function, call it accumulate.mean.count
, and apply this function over a collection of previous days using sapply
. The modifications are:
accumulate.mean.count <- function(this.day, lag) {
condition1 <- df$Day == this.day - lag
condition1[which(df$Day == this.day - lag) < max(which(df$Day == this.day))] <- FALSE
if (any(condition1)) {
df$Count[df$Day == this.day] <<- mean(df$Count[condition1]) + df$Count[df$Day == this.day]
df$Count[condition1] <<- 0
}
}
lags <- seq_len(30)
for (i in unique(df$Day)) {
temp <- df$Count[df$Day == i]
if (any(temp > 0)) {
sapply(lags, accumulate.mean.count, this.day=i)
}
}
print(df[order(df$Count, decreasing = TRUE),])
Notes:
lag
is the number of days previous to (ie, that lag) the current day. A lag = 1
means the previous day, and a lag = 2
means two days previous, etc. lags
is a collection of these. Here, lags <- seq_len(30)
is a sequence from 1
to 30
over which accumulate.mean.count
is applied, which is what you want. See this for an excellent overview on the *apply
family of R functions. Note that lags
need not be a sequence but just a collection of integers such as c(1, 5, 10)
for the previous day, 5 days previous and 10 days previous. It does not even have to be positive if you want to roll in future days, but should not be zero.
Because of the lexical scoping rule of R , setting df$Count
, which is a variable outside the scope of accumulate.mean.count
, within the function accumulate.mean.count
requires <<-
instead of <-
. See this for an explanation and note the dangers of using <<-
mentioned there.
I do not have enough data to test lags <- seq_len(30)
, but for seq_len(1)
, I recovered the original result, and for seq_len(2)
, I got
## Station Day Count
##1 33012 12448 10
##2 35004 12448 10
##3 35008 12448 10
##4 37006 12448 10
##5 21009 4835 5
##6 24005 4835 5
##7 27001 4835 5
##16 47004 12445 1
##17 51001 12449 1
##18 51003 4832 1
##19 52004 4836 1
##8 25005 12447 0
##9 29001 12447 0
##10 29002 12447 0
##11 29002 12446 0
##12 30001 12446 0
##13 31002 12446 0
##14 47007 4834 0
##15 49002 4834 0
which I believe is what you would want.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.