简体   繁体   中英

Subset both column-wise and row-wise?

I have example data as below (real data is 150x200), and need to keep the best combination of rows and columns that would give me least NAs. I could use complete.cases() but it removes too many rows.

Just by looking at this example, it is obvious to exclude row x6 as it has most NA count. Similarly, we can exclude column A and F , as they have most NA count.

Need some hint on logic, doesn't have to be full code solution.

#reproducible data
df <- read.csv(text="
SampleID,A,B,C,D,E,F
x1,NA,x,NA,x,NA,x
x2,x,x,NA,x,x,NA
x3,NA,NA,x,x,x,NA
x4,x,x,x,NA,x,x
x5,x,x,x,x,x,x
x6,NA,NA,NA,x,NA,NA
x7,x,x,x,NA,x,x
x8,NA,NA,x,x,x,x
x9,x,x,x,x,x,NA
x10,x,x,x,x,x,x
x11,NA,x,x,x,x,NA")

# complete cases
df[ complete.cases(df),]
#   SampleID A B C D E F
#5        x5 x x x x x x
#10      x10 x x x x x x

Additional info : This is a data for risk calculation, rows are samples and columns are variables. Each variable has a risk factor of some value. Risk predicting algorithm (computed using different custom software) can work with say, with 5 variables or with 200. The more variables will give obviously more reliable answer. To be able to have comparable results most samples should have most overlapping variables. I will need to keep at least ~60% of samples - rows.

Here's a function for you to play with. Hopefully it's something you can modify to suit your needs:

almostComplete <- function(dataset, rowPct, colPct = rowPct, n = 1) {
  if (sum(is.na(dataset)) == 0) out <- dataset
  else {
    CS <- colSums(is.na(dataset))/ncol(dataset)
    RS <- rowSums(is.na(dataset))/nrow(dataset)
    if (is.null(rowPct)) rowPct <- head(sort(RS, decreasing=TRUE), n)[n]
    if (is.null(colPct)) colPct <- head(sort(CS, decreasing=TRUE), n)[n]

    dropCols <- which(CS >= colPct)
    dropRows <- which(RS >= rowPct)
    out <- dataset[setdiff(sequence(nrow(dataset)), dropRows),
                   setdiff(sequence(ncol(dataset)), dropCols)]
  }
  out
}

A few examples...

Drop whichever row and column have the highest percentage of NA values:

almostComplete(df, NULL, NULL)
#    SampleID    B    C    D    E
# 1        x1    x <NA>    x <NA>
# 2        x2    x <NA>    x    x
# 3        x3 <NA>    x    x    x
# 4        x4    x    x <NA>    x
# 5        x5    x    x    x    x
# 7        x7    x    x <NA>    x
# 8        x8 <NA>    x    x    x
# 9        x9    x    x    x    x
# 10      x10    x    x    x    x
# 11      x11    x    x    x    x

Drop the rows and columns which have more than the second highest percentage of NA values.

almostComplete(df, NULL, NULL, n = 2)
#    SampleID    B    C    D E
# 2        x2    x <NA>    x x
# 4        x4    x    x <NA> x
# 5        x5    x    x    x x
# 7        x7    x    x <NA> x
# 8        x8 <NA>    x    x x
# 9        x9    x    x    x x
# 10      x10    x    x    x x
# 11      x11    x    x    x x

Set one threshold value for both rows and columns.

almostComplete(df, .7)
#    SampleID    B    C    D    E
# 1        x1    x <NA>    x <NA>
# 2        x2    x <NA>    x    x
# 3        x3 <NA>    x    x    x
# 4        x4    x    x <NA>    x
# 5        x5    x    x    x    x
# 6        x6 <NA> <NA>    x <NA>
# 7        x7    x    x <NA>    x
# 8        x8 <NA>    x    x    x
# 9        x9    x    x    x    x
# 10      x10    x    x    x    x
# 11      x11    x    x    x    x

Specify row and column threshold values separately.

almostComplete(df, .2, .5)
#    SampleID    B    C    D E
# 2        x2    x <NA>    x x
# 4        x4    x    x <NA> x
# 5        x5    x    x    x x
# 7        x7    x    x <NA> x
# 8        x8 <NA>    x    x x
# 9        x9    x    x    x x
# 10      x10    x    x    x x
# 11      x11    x    x    x x

Not knowing exactly how you're going to use this, I'm not sure if this is useful or not....

Maybe you could use apply with the function :

count.na <- function(vec) { return (length(which(is.na(vec)))) }

And finaly choose colums and rows with a percentage of NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM