简体   繁体   English

选择值的百分比满足阈值的行的子集

[英]Selecting a subset of rows where a % of the values meet the threshold

I have a dataframe with values in rows and samples in columns (two groups, A and B). 我有一个数据框,其中包含行中的值和列中的样本(两组,A和B)。 Example df: 示例df:

df <- rbind(rep(1, times = 10), 
        c(rep(1, times = 9), 2), 
        c(rep(1, times = 8), rep(2, times = 2)),
        c(rep(1, times = 7), rep(2, times = 3)), rep(1, times = 10), 
        c(rep(1, times = 9), 2), 
        c(rep(1, times = 8), rep(2, times = 2)), 
        c(rep(2, times = 7), rep(1, times = 3)))
colnames(df) <- c("A1", "A2", "A3", "A4", "A5",
              "B1", "B2", "B3", "B4", "B5")
row.names(df) <- 1:8

I have been selecting subset of rows where all samples are below a certain threshold using the following: 我一直在选择行的子集,其中所有样本都低于某个阈值,使用以下内容:

selected <- apply(df, MARGIN = 1, function(x) all(x < 1.5))
df.sel <- df[selected,]

result of this is 结果是

df[c(1,5),]

I require two further type of selections. 我需要另外两种选择。 The first is to select, for example, all rows where at least 90% of the samples are below the threshold values of 1.5. 第一种是选择例如至少90%的样本低于1.5的阈值的所有行。 The result of this should be: 结果应该是:

df[c(1,2,5,6)]

The second is to select by group. 第二是按组选择。 Say, rows where at least 50% of values in at least one of the groups is > that 1.5. 比方说,至少有一个组中至少50%的值> 1.5的行。 This should give me the following df: 这应该给我以下df:

df[c(4,8),]

I am new to Stackoverflow and i have been asked in the past to put example. 我是Stackoverflow的新手,过去我曾被要求提供示例。 I hope this is good! 我希望这很好!

df[!rowSums(df >= 1.5),]
##   A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 1  1  1  1  1  1  1  1  1  1  1
## 5  1  1  1  1  1  1  1  1  1  1

df[rowMeans(df < 1.5) >= 0.9,]
##   A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 1  1  1  1  1  1  1  1  1  1  1
## 2  1  1  1  1  1  1  1  1  1  2
## 5  1  1  1  1  1  1  1  1  1  1
## 6  1  1  1  1  1  1  1  1  1  2

idx <- apply(df, 1, function(x) {
    any(tapply(x, gsub("[0-9]", "", names(x)), function(y) mean(y > 1.5)) > 0.5)
    })

df[idx,]
##   A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 4  1  1  1  1  1  1  1  2  2  2
## 8  2  2  2  2  2  2  2  1  1  1

In your specific case of nearly all-ones, you can do all this with rowMeans or colMeans . 在几乎全部的特定情况下,您可以使用rowMeanscolMeans完成所有这些colMeans (There is also plyr::colwise for more complicated stuff). (对于更复杂的东西,还有plyr::colwise )。

Select subset of rows where all samples are below a certain threshold using the following: 使用以下选项选择所有样本均低于特定阈值的行子集:

df[rowMeans(df)<1.5,]

Select all rows where >=90% of samples are below the threshold value of 1.5. 选择> = 90%样本低于阈值1.5的所有行。 (would be much easier if we can exploit knowing that the only other value is 2) (如果我们可以利用知道唯一的其他值是2)会更容易

You can directly count the proportion of '1' entries with: 您可以直接计算'1'条目的比例:

> apply(df, 1, function(x) sum(x==1)) /ncol(df)
  1   2   3   4   5   6   7   8 
1.0 0.9 0.8 0.7 1.0 0.9 0.8 0.3

Thus to get the row-indices you want: 因此,要获得所需的行索引:

> apply(df, 1, function(x) sum(x==1)) /ncol(df) >= 0.9
    1     2     3     4     5     6     7     8 
 TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE 

and the row-slice you want: 和你想要的行切片:

> df[ apply(df, 1, function(x) sum(x==1)) /ncol(df) >= 0.9 , ]
  A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
1  1  1  1  1  1  1  1  1  1  1
2  1  1  1  1  1  1  1  1  1  2
5  1  1  1  1  1  1  1  1  1  1
6  1  1  1  1  1  1  1  1  1  2

The second is to select by group. 第二是按组选择。 Say, rows where at least 50% of values in at least one of the groups is > that 1.5. 比方说,至少有一个组中至少50%的值> 1.5的行。

Unless I misunderstand what you meant by 'at least one of the groups', your example's wrong. 除非我误解了“至少有一个团体”的意思,否则你的例子就错了。 Row 4 doesn't qualify, only row 8. 第4行不符合条件,只有第8行。

Again, you could either cheat with rowSums , or else: 再次,您可以使用rowSums作弊,或者:

> apply(df, 1, function(x) sum(x>=1.5)) /ncol(df) >= 0.5
1     2     3     4     5     6     7     8 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE 

That only gets you row 8 not 4, so have I misunderstood you? 那只能让你排8而不是4,所以我误解了你吗? (Jake Burhead clarifies you are doing hierarchical indexing by string name of column. See his solution, there's no point in me reproducing it.) (Jake Burhead澄清你正在通过列的字符串名称进行层次索引。看到他的解决方案,我没有必要复制它。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM