I have a dataframe with values in rows and samples in columns (two groups, A and B). Example df:
df <- rbind(rep(1, times = 10),
c(rep(1, times = 9), 2),
c(rep(1, times = 8), rep(2, times = 2)),
c(rep(1, times = 7), rep(2, times = 3)), rep(1, times = 10),
c(rep(1, times = 9), 2),
c(rep(1, times = 8), rep(2, times = 2)),
c(rep(2, times = 7), rep(1, times = 3)))
colnames(df) <- c("A1", "A2", "A3", "A4", "A5",
"B1", "B2", "B3", "B4", "B5")
row.names(df) <- 1:8
I have been selecting subset of rows where all samples are below a certain threshold using the following:
selected <- apply(df, MARGIN = 1, function(x) all(x < 1.5))
df.sel <- df[selected,]
result of this is
df[c(1,5),]
I require two further type of selections. The first is to select, for example, all rows where at least 90% of the samples are below the threshold values of 1.5. The result of this should be:
df[c(1,2,5,6)]
The second is to select by group. Say, rows where at least 50% of values in at least one of the groups is > that 1.5. This should give me the following df:
df[c(4,8),]
I am new to Stackoverflow and i have been asked in the past to put example. I hope this is good!
df[!rowSums(df >= 1.5),]
## A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 1 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1 1
df[rowMeans(df < 1.5) >= 0.9,]
## A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 2
## 5 1 1 1 1 1 1 1 1 1 1
## 6 1 1 1 1 1 1 1 1 1 2
idx <- apply(df, 1, function(x) {
any(tapply(x, gsub("[0-9]", "", names(x)), function(y) mean(y > 1.5)) > 0.5)
})
df[idx,]
## A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
## 4 1 1 1 1 1 1 1 2 2 2
## 8 2 2 2 2 2 2 2 1 1 1
In your specific case of nearly all-ones, you can do all this with rowMeans
or colMeans
. (There is also plyr::colwise
for more complicated stuff).
Select subset of rows where all samples are below a certain threshold using the following:
df[rowMeans(df)<1.5,]
Select all rows where >=90% of samples are below the threshold value of 1.5. (would be much easier if we can exploit knowing that the only other value is 2)
You can directly count the proportion of '1' entries with:
> apply(df, 1, function(x) sum(x==1)) /ncol(df)
1 2 3 4 5 6 7 8
1.0 0.9 0.8 0.7 1.0 0.9 0.8 0.3
Thus to get the row-indices you want:
> apply(df, 1, function(x) sum(x==1)) /ncol(df) >= 0.9
1 2 3 4 5 6 7 8
TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
and the row-slice you want:
> df[ apply(df, 1, function(x) sum(x==1)) /ncol(df) >= 0.9 , ]
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 2
5 1 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 2
The second is to select by group. Say, rows where at least 50% of values in at least one of the groups is > that 1.5.
Unless I misunderstand what you meant by 'at least one of the groups', your example's wrong. Row 4 doesn't qualify, only row 8.
Again, you could either cheat with rowSums
, or else:
> apply(df, 1, function(x) sum(x>=1.5)) /ncol(df) >= 0.5
1 2 3 4 5 6 7 8
FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
That only gets you row 8 not 4, so have I misunderstood you? (Jake Burhead clarifies you are doing hierarchical indexing by string name of column. See his solution, there's no point in me reproducing it.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.