选择值的百分比满足阈值的行的子集

Question

I have a dataframe with values in rows and samples in columns (two groups, A and B). 我有一个数据框，其中包含行中的值和列中的样本（两组，A和B）。 Example df: 示例df：

df <- rbind(rep(1, times = 10), 
        c(rep(1, times = 9), 2), 
        c(rep(1, times = 8), rep(2, times = 2)),
        c(rep(1, times = 7), rep(2, times = 3)), rep(1, times = 10), 
        c(rep(1, times = 9), 2), 
        c(rep(1, times = 8), rep(2, times = 2)), 
        c(rep(2, times = 7), rep(1, times = 3)))
colnames(df) <- c("A1", "A2", "A3", "A4", "A5",
              "B1", "B2", "B3", "B4", "B5")
row.names(df) <- 1:8

I have been selecting subset of rows where all samples are below a certain threshold using the following: 我一直在选择行的子集，其中所有样本都低于某个阈值，使用以下内容：

selected <- apply(df, MARGIN = 1, function(x) all(x < 1.5))
df.sel <- df[selected,]

result of this is 结果是

df[c(1,5),]

I require two further type of selections. 我需要另外两种选择。 The first is to select, for example, all rows where at least 90% of the samples are below the threshold values of 1.5. 第一种是选择例如至少90％的样本低于1.5的阈值的所有行。 The result of this should be: 结果应该是：

df[c(1,2,5,6)]

The second is to select by group. 第二是按组选择。 Say, rows where at least 50% of values in at least one of the groups is > that 1.5. 比方说，至少有一个组中至少50％的值> 1.5的行。 This should give me the following df: 这应该给我以下df：

df[c(4,8),]

I am new to Stackoverflow and i have been asked in the past to put example. 我是Stackoverflow的新手，过去我曾被要求提供示例。 I hope this is good! 我希望这很好！

Answer 1

df[!rowSums(df >= 1.5),]
##   A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 1  1  1  1  1  1  1  1  1  1  1
## 5  1  1  1  1  1  1  1  1  1  1

df[rowMeans(df < 1.5) >= 0.9,]
##   A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 1  1  1  1  1  1  1  1  1  1  1
## 2  1  1  1  1  1  1  1  1  1  2
## 5  1  1  1  1  1  1  1  1  1  1
## 6  1  1  1  1  1  1  1  1  1  2

idx <- apply(df, 1, function(x) {
    any(tapply(x, gsub("[0-9]", "", names(x)), function(y) mean(y > 1.5)) > 0.5)
    })

df[idx,]
##   A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 4  1  1  1  1  1  1  1  2  2  2
## 8  2  2  2  2  2  2  2  1  1  1

Answer 2

In your specific case of nearly all-ones, you can do all this with rowMeans or colMeans . 在几乎全部的特定情况下，您可以使用rowMeans或colMeans完成所有这些colMeans 。 (There is also plyr::colwise for more complicated stuff). （对于更复杂的东西，还有plyr::colwise ）。

Select subset of rows where all samples are below a certain threshold using the following: 使用以下选项选择所有样本均低于特定阈值的行子集：

df[rowMeans(df)<1.5,]

Select all rows where >=90% of samples are below the threshold value of 1.5. 选择> = 90％样本低于阈值1.5的所有行。 (would be much easier if we can exploit knowing that the only other value is 2) （如果我们可以利用知道唯一的其他值是2）会更容易

You can directly count the proportion of '1' entries with: 您可以直接计算'1'条目的比例：

> apply(df, 1, function(x) sum(x==1)) /ncol(df)
  1   2   3   4   5   6   7   8 
1.0 0.9 0.8 0.7 1.0 0.9 0.8 0.3

Thus to get the row-indices you want: 因此，要获得所需的行索引：

> apply(df, 1, function(x) sum(x==1)) /ncol(df) >= 0.9
    1     2     3     4     5     6     7     8 
 TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE

and the row-slice you want: 和你想要的行切片：

> df[ apply(df, 1, function(x) sum(x==1)) /ncol(df) >= 0.9 , ]
  A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
1  1  1  1  1  1  1  1  1  1  1
2  1  1  1  1  1  1  1  1  1  2
5  1  1  1  1  1  1  1  1  1  1
6  1  1  1  1  1  1  1  1  1  2

The second is to select by group. 第二是按组选择。 Say, rows where at least 50% of values in at least one of the groups is > that 1.5. 比方说，至少有一个组中至少50％的值> 1.5的行。

Unless I misunderstand what you meant by 'at least one of the groups', your example's wrong. 除非我误解了“至少有一个团体”的意思，否则你的例子就错了。 Row 4 doesn't qualify, only row 8. 第4行不符合条件，只有第8行。

Again, you could either cheat with rowSums , or else: 再次，您可以使用rowSums作弊，或者：

> apply(df, 1, function(x) sum(x>=1.5)) /ncol(df) >= 0.5
1     2     3     4     5     6     7     8 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

That only gets you row 8 not 4, so have I misunderstood you? 那只能让你排8而不是4，所以我误解了你吗？ (Jake Burhead clarifies you are doing hierarchical indexing by string name of column. See his solution, there's no point in me reproducing it.) （Jake Burhead澄清你正在通过列的字符串名称进行层次索引。看到他的解决方案，我没有必要复制它。）

选择值的百分比满足阈值的行的子集

问题描述

2 个解决方案

解决方案1
3 已采纳 2014-03-21 10:39:06

解决方案2
1 2014-03-21 10:56:40

选择值的百分比满足阈值的行的子集

问题描述

2 个解决方案

解决方案1 3 已采纳 2014-03-21 10:39:06

解决方案2 1 2014-03-21 10:56:40

解决方案1
3 已采纳 2014-03-21 10:39:06

解决方案2
1 2014-03-21 10:56:40