簡體   English   中英

從數據幀中的重復測量中檢測不可能的數據輸入錯誤

[英]Detect impossible data entry errors from repeated measures in data frames

我必須通過重復測量個體的幾個變量來檢查大型數據庫。 由於我可以有超過300萬的觀察,我想至少刪除我確定數據輸入錯誤的數據。

連續變量

例如,關注可變權重(例如下面的數據框),我知道在一次觀察和下一次觀察之間,個體不能將體重減輕40%以上。 如何檢測具有較高體重減輕的觀察結果,如個體“2”的第三次觀察,其體重從30克減少到3克。

分類變量

例如,關於個人的狀態。 一個人可以被分類為3種狀態(例如“少年”,“成年非種雞”或“成年種雞”;分別為1,2和3)。 我知道一個人如果是成年人(“2”或“3”)就不能成為少年(“1”),但有可能在3 - > 2之間過渡。 在這個特殊情況下,我想檢測觀察9,其中個體“3”被歸類為“少年”,但在之前的觀察中被歸類為“成人”。

Individuals <- c(1,1,1,2,2,2,3,3,3)
Weight <- c(10, 14, 20, 15, 30, 3, 12, 34, 30)
Week <- rep(1:3, 3)
Status <- c(1, 2, 3, 2, 3, 3, 2, 3, 1)
df <- as.data.frame (cbind(Individuals, Weight, Week, Status))
df

        Individuals Weight Week Status
1           1     10    1      1
2           1     14    2      2
3           1     20    3      3
4           2     15    1      2
5           2     30    2      3
6           2      3    3      3
7           3     12    1      2
8           3     34    2      3
9           3     30    3      1

你知道我怎么能解決這兩種錯誤?

根據您的描述並僅根據您上面提到的“問題”試試這個:

Individuals <- c(1,1,1,2,2,2,3,3,3)
Weight <- c(10, 14, 20, 15, 30, 3, 12, 34, 30)
Week <- rep(1:3, 3)
Status <- c(1, 2, 3, 2, 3, 3, 2, 3, 1)
df <- as.data.frame (cbind(Individuals, Weight, Week, Status))

library(dplyr)

df %>%
  group_by(Individuals) %>%      ## for each individual
  mutate(WeightReduce = 1-Weight/dplyr::lag(Weight, default = Weight[1])) %>%  ## calculate the weight reduce (negative numbers here mean weight increase)
  ungroup() %>%                  ## forget the grouping
  mutate(flag = ifelse(WeightReduce >= 0.4 | dplyr::lag(Status, default = Status[1]) %in% 2:3 & Status == 1, 1, 0))  ## flag errors based on filters


#    Individuals Weight  Week Status WeightReduce  flag
#          (dbl)  (dbl) (dbl)  (dbl)        (dbl) (dbl)
# 1           1     10     1      1    0.0000000     0
# 2           1     14     2      2   -0.4000000     0
# 3           1     20     3      3   -0.4285714     0
# 4           2     15     1      2    0.0000000     0
# 5           2     30     2      3   -1.0000000     0
# 6           2      3     3      3    0.9000000     1
# 7           3     12     1      2    0.0000000     0
# 8           3     34     2      3   -1.8333333     0
# 9           3     30     3      1    0.1176471     1

您可以使用data.table包計算體重變化率和青少年異常,然后對這兩個標准進行過濾:

library(data.table)

setDT(df)[,c('continuous', 'categorical'):=list(
              c(0,diff(Weight)/head(Weight, -1)),  # rate of weight change per individual
              Status==1 & c(F,diff(Status)<0)),Individuals][ 
          continuous>=-0.4 & !categorical,][]

#   Individuals Weight Week Status    change continuous categorical
#1:           1     10    1      1 0.0000000  0.0000000       FALSE
#2:           1     14    2      2 0.4000000  0.4000000       FALSE
#3:           1     20    3      3 0.4285714  0.4285714       FALSE
#4:           2     15    1      2 0.0000000  0.0000000       FALSE
#5:           2     30    2      3 1.0000000  1.0000000       FALSE
#6:           3     12    1      2 0.0000000  0.0000000       FALSE
#7:           3     34    2      3 1.8333333  1.8333333       FALSE

我希望這有幫助。

library(data.table)
  library(zoo)
  df <- data.table(df)
  # used to check percentage change in weight variable
  calcreduction <- function(x){
    res <- diff(x)/x[-length(x)]
    return(c(0,res))
  }
  # this will make it easy to get rid of values where WeightReduction < -.4

  #function used to assign combination type
  # you can have 11,12,13,22,23,32,33 or 21,31. The latter are "bad"
  getcomb <- function(x){
    res <- rbind(c(0,0),rollapply(x,2,paste))
    return(paste(res[,1],res[,2],sep=""))
  } 
  # this will make it easy to get rid of values where the Status change is no good

  # you can just pull the new vectors and then use logic
  # to decide what you want to do with these values
  res <- df[,list("WeightReduction"=calcreduction(Weight),
                  "StatusChange"=getcomb(Status),Weight,Week,Status),by=Individuals]

> res
   Individuals WeightReduction StatusChange Weight Week Status
1:           1       0.0000000           00     10    1      1
2:           1       0.4000000           12     14    2      2
3:           1       0.4285714           23     20    3      3
4:           2       0.0000000           00     15    1      2
5:           2       1.0000000           23     30    2      3
6:           2      -0.9000000           33      3    3      3
7:           3       0.0000000           00     12    1      2
8:           3       1.8333333           23     34    2      3
9:           3      -0.1176471           31     30    3      1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM