简体   繁体   中英

counting frequency of incorrect value in r

Here is my example dataset

 set.seed(123)
 myd <- data.frame (sub = paste ("S", 1:10, sep = ""), P1 = sample(c(1,-1,2,0), 10, replace = TRUE),
                    P2 = sample(c(1,-1,2,0), 10, replace = TRUE),
                    I1 = sample(c(1,-1,2,0), 10, replace = TRUE),
                    I2 = sample(c(1,-1,2,0), 10, replace = TRUE),
                    I3 = sample(c(1,-1,2,0), 10, replace = TRUE),
                    I4 = sample(c(1,-1,2,0), 10, replace = TRUE),
                    I5 = sample(c(1,-1,2,0), 10, replace = TRUE),
                    I6 = sample(c(1,-1,2,0), 10, replace = TRUE)
                    )
 myd 

  sub P1 P2 I1 I2 I3 I4 I5 I6
1   S1 -1  0  0  0  1  1  2  0
2   S2  0 -1  2  0 -1 -1  1  2
3   S3 -1  2  2  2 -1  0 -1  2
4   S4  0  2  0  0 -1  1 -1  1
5   S5  0  1  2  1  1  2  0 -1
6   S6  1  0  2 -1  1  1 -1  1
7   S7  2  1  2  0  1  1  0 -1
8   S8  0  1  2  1 -1  0  0  2
9   S9  2 -1 -1 -1 -1  0  0 -1
10 S10 -1  0  1  1  0 -1 -1  1

Translation table for incorrect values conditioned on values P1 and P2: -1 is missing value

  Condition   P1    P2         The value Incorrect
    I         1     1           None
    II        1     0           2
    III       0     1           2
     IV       2     0           2 or 0
      V       0     2          2 or 0
      VI      2     2          1 or 0
      VII     1     2          0
     VIII     2     1          0

 # if there is -1 in any of the value produce all values NA
      IX      -1      0           NA
      X        0     -1           NA
      XI      -1     -1           NA
      XII      -1     2           NA
       XIII     2    -1           NA
      XIV      -1     1           NA
      XV        1     -1           NA

The following is short code for transition table in data.frame format except** for IV, V, VI conditions where I did not know how to enter as there are two values:

 ttable <- data.frame (P1 = c(1,1,0,2,0,2,1,2,-1, 0,-1,-1,2,-1,1), 
                     P2 = c(1,0,1,0,2,2,2,1,0,-1,-1,2,-1,1,1), 
                   errort = c("None", 2,2,2, 2,1,0,0,NA, NA, NA, NA, NA, NA,NA))

What I am trying to look at for each s1 to s10 rows, I would like to check values in P1 and P2 column and match this with the values in I1 to I6 column:

   sub   P1 P2 I1 I2 I3 I4 I5 I6
1   S1   -1  0  0  0  1  1  2  0

In this case P1 and P2 one of value is -1 so all values will be NA.

Another case:

          sub   P1 P2  I1  I2  I3  I4   I5  I6
           S4   0  2   0   0  -1   1   -1   1

Here P1 = 0, P2 = 2, so the following values I1 = Incorrect, I2 = Incorrect, I3 = NA, I4 = correct, I5 = NA, I6 = correct

May be written as

sub   P1 P2  I1      I2     I3   I4     I5   I6
 S4   0  2   0      0      -1    1     -1    1

            FALSE, FALSE,  NA,  TRUE, NA,  TRUE 

This match with condition (V) and either 0 or 1 are incorrect while 1 is correct and -1 is missing

Another case: here P1 = 0 and P2 =1, match with condition (III) in match table, thus incorrect values would be 2.

 5   S5  0  1   2      1     1     2      0      -1
               FALSE, TRUE,  TRUE  FALSE  TRUE    NA

I need to calculated frequency of false, I tried a lot of if-else statements but not giving desired output, I feel messey with many of these and I do not think this efficient for a large dataset I will be using.

qcfun <- function (x) {
x <- x[3:length(x)]
obs1 =   table(c(x, 2, 0, 1, -1))
obs = obs1-1
ov <- NULL
if (x[1] == 1 & x[2] == 0){
ov = round (as.numeric (obs[4]/sum(obs)), 2)
} else {
if (x[1] == 0 & x[2] == 1){
ov = round (as.numeric (obs[4]/sum(obs)), 2)
} else {
if (x[1] == 1 & x[2] == 2){
ov = round (as.numeric (obs[2]/sum(obs)), 2)
} else {
if (x[1] == 2 & x[2] == 1){
ov = round (as.numeric (obs[2]/sum(obs)), 2)
} else {
if (x[1] == 1 & x[2] == 1){
ov = 0
} else {
ov = NA
}
}}}}
return (ov)
}
out1 <- apply(myd, 1,qcfun )
table (out1)
tout1 <- table (out1)

Is there a quick / efficient way of doing this?

You can use this vectorized function, it will be efficient for a large number of rows:

fixI <- function(p1, p2, i){
    negative <- (p1 < 0) | (p2 < 0) | (i < 0)
    result <- ifelse(negative, NA, TRUE)  # conditions IX to XV

    p <- p1 * 10 + p2

    result[!negative & p %in% c(10,1,20,2) & i==2] <- FALSE
    result[!negative & p %in% c(20,2,22,12,21) & i==0] <- FALSE
    result[!negative & p==22 & i==1] <- FALSE

    result
}

Apply it to I columns in myd :

mat <- sapply(myd[,paste0("I",1:6)], fixI, p1=myd$P1, p2=myd$P2)

rownames(mat) <- myd$sub

Result:

       I1    I2   I3    I4    I5    I6
S1     NA    NA   NA    NA    NA    NA
S2     NA    NA   NA    NA    NA    NA
S3     NA    NA   NA    NA    NA    NA
S4  FALSE FALSE   NA  TRUE    NA  TRUE
S5  FALSE  TRUE TRUE FALSE  TRUE    NA
S6  FALSE    NA TRUE  TRUE    NA  TRUE
S7   TRUE FALSE TRUE  TRUE FALSE    NA
S8  FALSE  TRUE   NA  TRUE  TRUE FALSE
S9     NA    NA   NA    NA    NA    NA
S10    NA    NA   NA    NA    NA    NA

Now you can count FALSE s like this:

By row:

apply(!mat, 1, sum, na.rm=TRUE)

 S1  S2  S3  S4  S5  S6  S7  S8  S9 S10 
  0   0   0   2   2   1   2   2   0   0 

By column:

apply(!mat, 2, sum, na.rm=TRUE)

 I1 I2 I3 I4 I5 I6 
  4  2  0  1  1  1 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM