[英]Set values to NA in data frame if Date is outside of a given interval
I have two dataframes, df1
and df2
. 我有两个数据帧,
df1
和df2
。
df1
contains values for different products X1
, X2
,and so on at different times. df1
包含不同时间的不同产品X1
, X2
值。 df2
contains the true start and end date for some of the products. df2
包含某些产品的真实开始和结束日期。 I want to replace the values outside of the given date intervals in df2
by NA
, as shown in the final table df3
. 我想要替换的值在给定日期的时间间隔以外
df2
由NA
,如图最后的表df3
。
Create df1
and df2
: 创建
df1
和df2
:
df1=data.frame(matrix(NA,10,6))
df1[,1]=(c(seq(as.Date("2012-01-01"),as.Date("2012-10-01"),by="1 month")))
df1[,2]=c(1:10); df1[,3]=c(12:21); df1[,4]=c(0.5:10); df1[,5]=c(5:14); df1[,6]=c(10:19)
colnames(df1)=c("Date","X1","X2","X3","X4","X5")
df2=data.frame(matrix(data=c("X1","X2","X4","2012-02-01","2012-04-01","2012-06-01","2012-09-01","2012-06-01","2012-10-01"),3,3))
colnames(df2)=c("Name","Start","End")
Output: 输出:
> df1
Date X1 X2 X3 X4 X5
1 2012-01-01 1 12 0.5 5 10
2 2012-02-01 2 13 1.5 6 11
3 2012-03-01 3 14 2.5 7 12
4 2012-04-01 4 15 3.5 8 13
5 2012-05-01 5 16 4.5 9 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 18 6.5 11 16
8 2012-08-01 8 19 7.5 12 17
9 2012-09-01 9 20 8.5 13 18
10 2012-10-01 10 21 9.5 14 19
> df2
Name Start End
1 X1 2012-02-01 2012-09-01
2 X2 2012-04-01 2012-06-01
3 X4 2012-06-01 2012-10-01
Final output should look like this: 最终输出应如下所示:
df3
Date X1 X2 X3 X4 X5
1 2012-01-01 NA NA 0.5 NA 10
2 2012-02-01 2 NA 1.5 NA 11
3 2012-03-01 3 NA 2.5 NA 12
4 2012-04-01 4 15 3.5 NA 13
5 2012-05-01 5 16 4.5 NA 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 NA 6.5 11 16
8 2012-08-01 8 NA 7.5 12 17
9 2012-09-01 9 NA 8.5 13 18
10 2012-10-01 NA NA 9.5 14 19
I am sure there is a more elegant way, but you could create a matrix of the indices that meet your criterion, where you set the elements to 1
if it is within your interval for that product and NA
if it isn't. 我确信有一种更优雅的方式,但你可以创建一个符合你标准的索引矩阵,如果它在你的产品间隔内,你将元素设置为
1
如果不在,则设置NA
。 Assuming you are dealing with numerical values you can then multiply your data frame with that index matrix: 假设您正在处理数值,那么您可以将数据框与该索引矩阵相乘:
Example: 例:
library(dplyr)
## Convert your dates to Date-objects:
df2 <- df2 %>% dplyr::mutate(Start = as.Date(Start), End = as.Date(End))
## Create a matrix of indices (TRUE/FALSE):
indMx <- lapply(names(df1)[-1], function(product){
(df1$Date >= df2$Start[df2$Name == product]) &
(df1$Date <= df2$End[df2$Name == product])
}) %>% do.call('cbind',.)
## Multiply with NA^indMx, which gives you NA in place of FALSE and
## 1 in place of TRUE:
df1[,-1] <- df1[,-1]*NA^indMx
df1
# Date X1 X2 X3
# 1 2012-01-01 1 12 0.5
# 2 2012-02-01 NA 13 1.5
# 3 2012-03-01 NA 14 2.5
# 4 2012-04-01 NA NA 3.5
# 5 2012-05-01 NA NA 4.5
# 6 2012-06-01 NA NA NA
# 7 2012-07-01 NA 18 NA
# 8 2012-08-01 NA 19 NA
# 9 2012-09-01 NA 20 NA
# 10 2012-10-01 10 21 NA
Here is one solution with data.table
. 这是一个
data.table
解决方案。 There might be a more elegant method using non-equi joins. 使用非equi连接可能有更优雅的方法。
for(i in seq_len(nrow(df2))) df1[!(Date %between% df2[i,.(Start, End)]), df2[i, Name] := NA]
Here, you run through each row of df2, subset df1 based on dates outside of the start and end dates in the current row of df2, and then assign NA to the variable given in df2. 在这里,您将根据df2当前行中开始日期和结束日期之外的日期遍历df2,子集df1的每一行,然后将NA分配给df2中给出的变量。
This returns 这回来了
df1
Date X1 X2 X3
1: 2012-01-01 NA NA NA
2: 2012-02-01 2 NA NA
3: 2012-03-01 3 NA NA
4: 2012-04-01 4 15 NA
5: 2012-05-01 5 16 NA
6: 2012-06-01 6 17 5.5
7: 2012-07-01 7 NA 6.5
8: 2012-08-01 8 NA 7.5
9: 2012-09-01 9 NA 8.5
10: 2012-10-01 NA NA 9.5
update 更新
If the data is constructed as was updated in the original post, then run this line first to convert the Names variable in df2 to a character vector (starts out as a factor). 如果数据构造为原始帖子中更新的数据,则首先运行此行以将df2中的Names变量转换为字符向量(作为因子开始)。 Then the above code will work for the new dataset.
然后上面的代码将适用于新数据集。
# convert data.frames to data.tables
setDT(df1)
setDT(df2)
# convert factor to character
df2[, Name := as.character(Name)]
data 数据
library(data.table)
# read in data
df1 <- fread("Date X1 X2 X3
2012-01-01 1 12 0.5
2012-02-01 2 13 1.5
2012-03-01 3 14 2.5
2012-04-01 4 15 3.5
2012-05-01 5 16 4.5
2012-06-01 6 17 5.5
2012-07-01 7 18 6.5
2012-08-01 8 19 7.5
2012-09-01 9 20 8.5
2012-10-01 10 21 9.5")
df2 <- fread(" Name Start End
X1 2012-02-01 2012-09-01
X2 2012-04-01 2012-06-01
X3 2012-06-01 2012-10-01")
# convert to date type
df1[, Date := as.Date(Date)]
df2[, c("Start", "End") := .(as.Date(Start), as.Date(End))]
Using dplyr
and tidyr
... 使用
dplyr
和tidyr
......
library(tidyr)
library(dplyr)
df3 <- df1 %>% gather(key=Name,value=value,-Date) %>% #convert to long form
left_join(df2) %>% #merge in date limits
mutate(ind=(as.Date(Date)>=as.Date(Start) & as.Date(Date)<=as.Date(End))) %>% #check valid
mutate(value=replace(value,!ind,NA)) %>% #replace invalid with NA
select(Date,Name,value) %>% #remove unnecessary variables
spread(key=Name,value=value) #convert back to rectangular form
df3
Date X1 X2 X3 X4 X5
1 2012-01-01 NA NA 0.5 NA 10
2 2012-02-01 2 NA 1.5 NA 11
3 2012-03-01 3 NA 2.5 NA 12
4 2012-04-01 4 15 3.5 NA 13
5 2012-05-01 5 16 4.5 NA 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 NA 6.5 11 16
8 2012-08-01 8 NA 7.5 12 17
9 2012-09-01 9 NA 8.5 13 18
10 2012-10-01 NA NA 9.5 14 19
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.