[英]Detect impossible data entry errors from repeated measures in data frames
我必須通過重復測量個體的幾個變量來檢查大型數據庫。 由於我可以有超過300萬的觀察,我想至少刪除我確定數據輸入錯誤的數據。
連續變量
例如,關注可變權重(例如下面的數據框),我知道在一次觀察和下一次觀察之間,個體不能將體重減輕40%以上。 如何檢測具有較高體重減輕的觀察結果,如個體“2”的第三次觀察,其體重從30克減少到3克。
分類變量
例如,關於個人的狀態。 一個人可以被分類為3種狀態(例如“少年”,“成年非種雞”或“成年種雞”;分別為1,2和3)。 我知道一個人如果是成年人(“2”或“3”)就不能成為少年(“1”),但有可能在3 - > 2之間過渡。 在這個特殊情況下,我想檢測觀察9,其中個體“3”被歸類為“少年”,但在之前的觀察中被歸類為“成人”。
Individuals <- c(1,1,1,2,2,2,3,3,3)
Weight <- c(10, 14, 20, 15, 30, 3, 12, 34, 30)
Week <- rep(1:3, 3)
Status <- c(1, 2, 3, 2, 3, 3, 2, 3, 1)
df <- as.data.frame (cbind(Individuals, Weight, Week, Status))
df
Individuals Weight Week Status
1 1 10 1 1
2 1 14 2 2
3 1 20 3 3
4 2 15 1 2
5 2 30 2 3
6 2 3 3 3
7 3 12 1 2
8 3 34 2 3
9 3 30 3 1
你知道我怎么能解決這兩種錯誤?
根據您的描述並僅根據您上面提到的“問題”試試這個:
Individuals <- c(1,1,1,2,2,2,3,3,3)
Weight <- c(10, 14, 20, 15, 30, 3, 12, 34, 30)
Week <- rep(1:3, 3)
Status <- c(1, 2, 3, 2, 3, 3, 2, 3, 1)
df <- as.data.frame (cbind(Individuals, Weight, Week, Status))
library(dplyr)
df %>%
group_by(Individuals) %>% ## for each individual
mutate(WeightReduce = 1-Weight/dplyr::lag(Weight, default = Weight[1])) %>% ## calculate the weight reduce (negative numbers here mean weight increase)
ungroup() %>% ## forget the grouping
mutate(flag = ifelse(WeightReduce >= 0.4 | dplyr::lag(Status, default = Status[1]) %in% 2:3 & Status == 1, 1, 0)) ## flag errors based on filters
# Individuals Weight Week Status WeightReduce flag
# (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 10 1 1 0.0000000 0
# 2 1 14 2 2 -0.4000000 0
# 3 1 20 3 3 -0.4285714 0
# 4 2 15 1 2 0.0000000 0
# 5 2 30 2 3 -1.0000000 0
# 6 2 3 3 3 0.9000000 1
# 7 3 12 1 2 0.0000000 0
# 8 3 34 2 3 -1.8333333 0
# 9 3 30 3 1 0.1176471 1
您可以使用data.table
包計算體重變化率和青少年異常,然后對這兩個標准進行過濾:
library(data.table)
setDT(df)[,c('continuous', 'categorical'):=list(
c(0,diff(Weight)/head(Weight, -1)), # rate of weight change per individual
Status==1 & c(F,diff(Status)<0)),Individuals][
continuous>=-0.4 & !categorical,][]
# Individuals Weight Week Status change continuous categorical
#1: 1 10 1 1 0.0000000 0.0000000 FALSE
#2: 1 14 2 2 0.4000000 0.4000000 FALSE
#3: 1 20 3 3 0.4285714 0.4285714 FALSE
#4: 2 15 1 2 0.0000000 0.0000000 FALSE
#5: 2 30 2 3 1.0000000 1.0000000 FALSE
#6: 3 12 1 2 0.0000000 0.0000000 FALSE
#7: 3 34 2 3 1.8333333 1.8333333 FALSE
我希望這有幫助。
library(data.table)
library(zoo)
df <- data.table(df)
# used to check percentage change in weight variable
calcreduction <- function(x){
res <- diff(x)/x[-length(x)]
return(c(0,res))
}
# this will make it easy to get rid of values where WeightReduction < -.4
#function used to assign combination type
# you can have 11,12,13,22,23,32,33 or 21,31. The latter are "bad"
getcomb <- function(x){
res <- rbind(c(0,0),rollapply(x,2,paste))
return(paste(res[,1],res[,2],sep=""))
}
# this will make it easy to get rid of values where the Status change is no good
# you can just pull the new vectors and then use logic
# to decide what you want to do with these values
res <- df[,list("WeightReduction"=calcreduction(Weight),
"StatusChange"=getcomb(Status),Weight,Week,Status),by=Individuals]
> res
Individuals WeightReduction StatusChange Weight Week Status
1: 1 0.0000000 00 10 1 1
2: 1 0.4000000 12 14 2 2
3: 1 0.4285714 23 20 3 3
4: 2 0.0000000 00 15 1 2
5: 2 1.0000000 23 30 2 3
6: 2 -0.9000000 33 3 3 3
7: 3 0.0000000 00 12 1 2
8: 3 1.8333333 23 34 2 3
9: 3 -0.1176471 31 30 3 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.