简体   繁体   English

如何根据为行子集指定的阈值对列进行子集化

[英]How to subset columns based on threshold values specified for a subset of rows

I want to subset columns of a big dataframe that adhere to this rule:我想对遵守此规则的大数据框的列进行子集化:

For each row ( except row A) the value should be below 5.对于每一行(A 行除外),该值应低于 5。

Given the following example dataframe, I want the function to return a dataframe with only column c1 since all values in row B:E are below 5. I think the select_if function is probably the way to go, but I can't figure out how to exclude specific rows in that function.鉴于以下示例数据框,我希望该函数返回一个包含 c1 列的数据框,因为 B:E 行中的所有值都低于 5。我认为 select_if 函数可能是要走的路,但我不知道如何排除该函数中的特定行。

Gene <- c("A", "B", "C", "D", "E")
c1 <- c(500, 1, 0, 3, 0)
c2 <- c(240, 235, 270, 100, 1)
c3 <- c(0, 3, 1000, 900, 2)
df1 <- data.frame(Gene, c1, c2, c3)
head(df1)

  Gene  c1  c2   c3
1    A 500 240    0
2    B   1 235    3
3    C   0 270 1000
4    D   3 100  900
5    E   0   1    2

A tidyverse solution is一个 tidyverse 解决方案是

df1 %>% 
  select(
    df1 %>% 
      filter(row_number() > 1) %>% 
      summarise(across(starts_with("c"), max)) %>% 
      pivot_longer(everything()) %>% 
      filter(value < 5) %>% 
      pull(name)
  )
   c1
1 500
2   1
3   0
4   3
5   0

Explanation: the code inside the select calculates the maximum value for each column after ignoring the first row.解释: select里面的代码在忽略第一行后计算每一列的最大值。 The result is then pivot ed into long format, creating default columns name and value .然后将pivot转换为长格式,创建默认列namevalue This data frame is filtered to select only those columns where every value is less than five.过滤此数据框以仅选择每个值都小于 5 的那些列。 The name column is then pull ed and used as an argument to the outer select .然后name列被pull并用作外部select的参数。

If you need other columns, just modify the select , for example,如果您需要其他列,只需修改select ,例如,

df1 %>% 
  select(
    c("Gene", 
    df1 %>% 
      filter(row_number() > 1) %>% 
      summarise(across(starts_with("c"), max)) %>% 
      pivot_longer(everything()) %>% 
      filter(value < 5) %>% 
      pull(name)
  )
)

A base R solution is simple to code.基本 R 解决方案易于编码。

i <- sapply(df1[-1], \(x) all(x[-1] < 5))
df1[c(TRUE, i)]
#>   Gene  c1
#> 1    A 500
#> 2    B   1
#> 3    C   0
#> 4    D   3
#> 5    E   0

Created on 2022-06-03 by the reprex package (v2.0.1)reprex 包于 2022-06-03 创建 (v2.0.1)

To avoid reshaping or looping, use the colSums .为避免重塑或循环,请使用colSums

df1[c(1, which(colSums(df1[-1, -1] < 5) == 4) + 1)]
#   Gene  c1
# 1    A 500
# 2    B   1
# 3    C   0
# 4    D   3
# 5    E   0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM