[英]How to subset columns based on threshold values specified for a subset of rows
I want to subset columns of a big dataframe that adhere to this rule:我想对遵守此规则的大数据框的列进行子集化:
For each row ( except row A) the value should be below 5.对于每一行(A 行除外),该值应低于 5。
Given the following example dataframe, I want the function to return a dataframe with only column c1 since all values in row B:E are below 5. I think the select_if function is probably the way to go, but I can't figure out how to exclude specific rows in that function.鉴于以下示例数据框,我希望该函数返回一个仅包含 c1 列的数据框,因为 B:E 行中的所有值都低于 5。我认为 select_if 函数可能是要走的路,但我不知道如何排除该函数中的特定行。
Gene <- c("A", "B", "C", "D", "E")
c1 <- c(500, 1, 0, 3, 0)
c2 <- c(240, 235, 270, 100, 1)
c3 <- c(0, 3, 1000, 900, 2)
df1 <- data.frame(Gene, c1, c2, c3)
head(df1)
Gene c1 c2 c3
1 A 500 240 0
2 B 1 235 3
3 C 0 270 1000
4 D 3 100 900
5 E 0 1 2
A tidyverse solution is一个 tidyverse 解决方案是
df1 %>%
select(
df1 %>%
filter(row_number() > 1) %>%
summarise(across(starts_with("c"), max)) %>%
pivot_longer(everything()) %>%
filter(value < 5) %>%
pull(name)
)
c1
1 500
2 1
3 0
4 3
5 0
Explanation: the code inside the select
calculates the maximum value for each column after ignoring the first row.解释: select
里面的代码在忽略第一行后计算每一列的最大值。 The result is then pivot
ed into long format, creating default columns name
and value
.然后将pivot
转换为长格式,创建默认列name
和value
。 This data frame is filtered to select only those columns where every value is less than five.过滤此数据框以仅选择每个值都小于 5 的那些列。 The name
column is then pull
ed and used as an argument to the outer select
.然后name
列被pull
并用作外部select
的参数。
If you need other columns, just modify the select
, for example,如果您需要其他列,只需修改select
,例如,
df1 %>%
select(
c("Gene",
df1 %>%
filter(row_number() > 1) %>%
summarise(across(starts_with("c"), max)) %>%
pivot_longer(everything()) %>%
filter(value < 5) %>%
pull(name)
)
)
A base R solution is simple to code.基本 R 解决方案易于编码。
i <- sapply(df1[-1], \(x) all(x[-1] < 5))
df1[c(TRUE, i)]
#> Gene c1
#> 1 A 500
#> 2 B 1
#> 3 C 0
#> 4 D 3
#> 5 E 0
Created on 2022-06-03 by the reprex package (v2.0.1)由reprex 包于 2022-06-03 创建 (v2.0.1)
To avoid reshaping or looping, use the colSums
.为避免重塑或循环,请使用colSums
。
df1[c(1, which(colSums(df1[-1, -1] < 5) == 4) + 1)]
# Gene c1
# 1 A 500
# 2 B 1
# 3 C 0
# 4 D 3
# 5 E 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.