[英]How to sum total observations in one dataset per ID that occur within time interval of another dataset
[英]How to get the value if a date from one dataset is within a period of time in another dataset for each id in R?
假設我有兩個數據集 A 和 B。對於數據集 A,它有 ID、日期和興趣。 對於數據集 B,它有 ID、date_1、date_2、Int。 如果數據集A中的日期在數據集B中的date_1和date_2的范圍內; 然后我想將 B 中的值 Int 提取到 A 中的興趣列。這是我運行的示例代碼。 但是得到了錯誤信息
"Error in if (subset_A[j, ]$date >= subset_B[k, ]$date_1 & subset_A[j, :
argument is of length zero"
.
A <- data.frame("ID" = c(1,1,1,2,2,3), "date" = c("1900-01-01","1900-11-01","1902-01-01","1903-01-01","1905-01-01","1900-01-01"), "Interest" = c(NA,NA,NA,NA,NA,NA), stringsAsFactors = FALSE)
A$date<-as.Date(A$date)
B <- data.frame("ID" = c(1,1,2,2,2,5),
"date_1" = c("1900-01-01","1900-02-01","1900-01-01","1901-02-01","1901-03-01","1900-01-01"),
"date_2" = c("1900-01-03","1903-01-01","1901-01-01","1901-03-01","1904-03-01","1903-01-01"),
"Int" = c(1,2,1,3,3,1))
B$date_1 <- as.Date(B$date_1)
B$date_2 <- as.Date(B$date_2)
在 R 中:
IDlist = unique(A$ID)
Table=NULL
for (i in 1:length(IDlist)){
subset_B <-subset(B, ID == IDlist[i])
subset_A <-subset(A, ID == IDlist[i])
for (j in 1:nrow(subset_A)){
for (k in 1:nrow(subset_B)){
if(subset_A[j,]$date >= subset_B[k,]$date_1&
subset_A[j,]$date <= subset_B[k,]$date_2&
!is.na(subset_B[k,]$date_1) &
!is.na(subset_B[k,]$date_2))
subset_A[j,]$Interest <- subset_B[k,]$Int
Table=rbind(Table,
subset_A)
}
}
}
我想獲取最后一列輸入為 c(1,2,2,3,NA,NA) 的數據框 A。 不知道為什么 for 循環不起作用。謝謝!
使用data.table
的non-equi join和update in a join這變成
library(data.table)
setDT(A)[, Interest := NULL][
setDT(B), on = .(ID, date >= date_1, date <= date_2), Interest := Int][]
ID date Interest 1: 1 1900-01-01 1 2: 1 1900-11-01 2 3: 1 1902-01-01 2 4: 2 1903-01-01 3 5: 2 1905-01-01 NA 6: 3 1900-01-01 NA
請注意,在更新連接之前必須從A
刪除Interest
列,因為它是用邏輯類型的NA
初始化的,而替換值是雙精度類型,並且向量列只能保存一種類型的數據。
1)使用SQL可以直接表達:
library(sqldf)
sqldf("select A.*, B.Int from A
left join B on A.ID = B.ID and A.date between B.date_1 and B.date_2")
給予:
ID date Interest Int
1 1 1900-01-01 NA 1
2 1 1900-11-01 NA 2
3 1 1902-01-01 NA 2
4 2 1903-01-01 NA 3
5 2 1905-01-01 NA NA
6 3 1900-01-01 NA NA
2)如果您真的想使用循環,則遍歷 A 的行,並為每個行獲取 B 中的相應元素:
Table <- A
for(i in 1:nrow(A)) {
ix <- which(A$ID[i] == B$ID & A$date[i] >= B$date_1 & A$date[i] <= B$date_2)[1]
Table$Int[i] <- B$Int[ix]
}
Table
給予:
ID date Interest Int
1 1 1900-01-01 NA 1
2 1 1900-11-01 NA 2
3 1 1902-01-01 NA 2
4 2 1903-01-01 NA 3
5 2 1905-01-01 NA NA
6 3 1900-01-01 NA NA
我們可以使用fuzzyjoin
library(fuzzyjoin)
library(dplyr)
fuzzy_left_join(A, B, by = c('ID', 'date' = 'date_1', 'date' = 'date_2'),
match_fun = list(`==`, `>=`, `<=`)) %>%
transmute(ID = ID.x, date, Interest = Int)
# ID date Interest
#1 1 1900-01-01 1
#2 1 1900-11-01 2
#3 1 1902-01-01 2
#4 2 1903-01-01 3
#5 2 1905-01-01 NA
#6 3 1900-01-01 NA
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.