简体   繁体   中英

Selecting a subset of rows where a % of the values meet the threshold

I have a dataframe with values in rows and samples in columns (two groups, A and B). Example df:

df <- rbind(rep(1, times = 10), 
        c(rep(1, times = 9), 2), 
        c(rep(1, times = 8), rep(2, times = 2)),
        c(rep(1, times = 7), rep(2, times = 3)), rep(1, times = 10), 
        c(rep(1, times = 9), 2), 
        c(rep(1, times = 8), rep(2, times = 2)), 
        c(rep(2, times = 7), rep(1, times = 3)))
colnames(df) <- c("A1", "A2", "A3", "A4", "A5",
              "B1", "B2", "B3", "B4", "B5")
row.names(df) <- 1:8

I have been selecting subset of rows where all samples are below a certain threshold using the following:

selected <- apply(df, MARGIN = 1, function(x) all(x < 1.5))
df.sel <- df[selected,]

result of this is

df[c(1,5),]

I require two further type of selections. The first is to select, for example, all rows where at least 90% of the samples are below the threshold values of 1.5. The result of this should be:

df[c(1,2,5,6)]

The second is to select by group. Say, rows where at least 50% of values in at least one of the groups is > that 1.5. This should give me the following df:

df[c(4,8),]

I am new to Stackoverflow and i have been asked in the past to put example. I hope this is good!

df[!rowSums(df >= 1.5),]
##   A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 1  1  1  1  1  1  1  1  1  1  1
## 5  1  1  1  1  1  1  1  1  1  1

df[rowMeans(df < 1.5) >= 0.9,]
##   A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 1  1  1  1  1  1  1  1  1  1  1
## 2  1  1  1  1  1  1  1  1  1  2
## 5  1  1  1  1  1  1  1  1  1  1
## 6  1  1  1  1  1  1  1  1  1  2

idx <- apply(df, 1, function(x) {
    any(tapply(x, gsub("[0-9]", "", names(x)), function(y) mean(y > 1.5)) > 0.5)
    })

df[idx,]
##   A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 4  1  1  1  1  1  1  1  2  2  2
## 8  2  2  2  2  2  2  2  1  1  1

In your specific case of nearly all-ones, you can do all this with rowMeans or colMeans . (There is also plyr::colwise for more complicated stuff).

Select subset of rows where all samples are below a certain threshold using the following:

df[rowMeans(df)<1.5,]

Select all rows where >=90% of samples are below the threshold value of 1.5. (would be much easier if we can exploit knowing that the only other value is 2)

You can directly count the proportion of '1' entries with:

> apply(df, 1, function(x) sum(x==1)) /ncol(df)
  1   2   3   4   5   6   7   8 
1.0 0.9 0.8 0.7 1.0 0.9 0.8 0.3

Thus to get the row-indices you want:

> apply(df, 1, function(x) sum(x==1)) /ncol(df) >= 0.9
    1     2     3     4     5     6     7     8 
 TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE 

and the row-slice you want:

> df[ apply(df, 1, function(x) sum(x==1)) /ncol(df) >= 0.9 , ]
  A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
1  1  1  1  1  1  1  1  1  1  1
2  1  1  1  1  1  1  1  1  1  2
5  1  1  1  1  1  1  1  1  1  1
6  1  1  1  1  1  1  1  1  1  2

The second is to select by group. Say, rows where at least 50% of values in at least one of the groups is > that 1.5.

Unless I misunderstand what you meant by 'at least one of the groups', your example's wrong. Row 4 doesn't qualify, only row 8.

Again, you could either cheat with rowSums , or else:

> apply(df, 1, function(x) sum(x>=1.5)) /ncol(df) >= 0.5
1     2     3     4     5     6     7     8 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE 

That only gets you row 8 not 4, so have I misunderstood you? (Jake Burhead clarifies you are doing hierarchical indexing by string name of column. See his solution, there's no point in me reproducing it.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM