[英]Create a new column as the lowest value within other columns in R
我有一個數據框,例如:
tab
Groups Species evalue bits NAME
1 G1 SP1 1.00 120 A
2 G1 SP1 0.50 130 B
3 G1 SP2 1.20 100 C
4 G1 SP3 0.02 190 X
5 G1 SP3 0.00 390 Z
6 G1 SP3 0.00 400 Y
7 G2 SP1 2.20 67 B
8 G2 SP1 2.10 69 A
而且我想在每個Groups
和Species
添加一個名為新列consensus_NAME
這是NAME
與最低值evalue
與最高bits
列值。
在這里我應該得到;
tab
Groups Species evalue bits NAME consensus_NAME
1 G1 SP1 1.00 120 A B
2 G1 SP1 0.50 130 B B
3 G1 SP2 1.20 100 C C
4 G1 SP3 0.02 190 X Y
5 G1 SP3 0.00 390 Z Y
6 G1 SP3 0.00 400 Y Y
7 G2 SP1 2.20 67 B A
8 G2 SP1 2.10 69 A A
所以票價我試過:
tab %>% filter(NAME != "") %>%
group_by(Groups,Species) %>%
top_n(-1,1, evalue,bits) %>%
distinct(consensus_NAME = NAME) %>%
right_join(tab)
這是數據框:
dput(tab)
structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L), .Label = c("G1", "G2"), class = "factor"), Species = structure(c(1L,
1L, 2L, 3L, 3L, 3L, 1L, 1L), .Label = c("SP1", "SP2", "SP3"), class = "factor"),
evalue = c(1, 0.5, 1.2, 0.02, 0, 0, 2.2, 2.1), bits = c(120L,
130L, 100L, 190L, 390L, 400L, 67L, 69L), NAME = structure(c(1L,
2L, 3L, 4L, 6L, 5L, 2L, 1L), .Label = c("A", "B", "C", "X",
"Y", "Z"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
我認為最干凈的方法是將group_by
與mutate
結合使用並評估組內所需的條件:
suppressPackageStartupMessages(library(dplyr))
t <- structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L), .Label = c("G1", "G2"), class = "factor"), Species = structure(c(1L,
1L, 2L, 3L, 3L, 3L, 1L, 1L), .Label = c("SP1", "SP2", "SP3"), class = "factor"),
evalue = c(1, 0.5, 1.2, 0.02, 0, 0, 2.2, 2.1), bits = c(120L,
130L, 100L, 190L, 390L, 400L, 67L, 69L), NAME = structure(c(1L,
2L, 3L, 4L, 6L, 5L, 2L, 1L), .Label = c("A", "B", "C", "X",
"Y", "Z"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
t %>% mutate_if(is.factor, as.character) %>%
group_by(Groups, Species) %>%
mutate(
consensus_name = NAME[bits == max(bits) & evalue == min(evalue) ]
)
#> # A tibble: 8 x 6
#> # Groups: Groups, Species [4]
#> Groups Species evalue bits NAME consensus_name
#> <chr> <chr> <dbl> <int> <chr> <chr>
#> 1 G1 SP1 1 120 A B
#> 2 G1 SP1 0.5 130 B B
#> 3 G1 SP2 1.2 100 C C
#> 4 G1 SP3 0.02 190 X Y
#> 5 G1 SP3 0 390 Z Y
#> 6 G1 SP3 0 400 Y Y
#> 7 G2 SP1 2.2 67 B A
#> 8 G2 SP1 2.1 69 A A
Created on 2021-07-19 by the reprex package (v2.0.0)
此代碼僅在evalue
和bits
始終為min
和max
才是健壯的。
例如:
t[2,4] <- 110
代碼會崩潰。
這會產生所需的結果,但是(如我的評論中所述)它將evalue
優先於bits
。
tab <- structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L), .Label = c("G1", "G2"), class = "factor"), Species = structure(c(1L,
1L, 2L, 3L, 3L, 3L, 1L, 1L), .Label = c("SP1", "SP2", "SP3"), class = "factor"),
evalue = c(1, 0.5, 1.2, 0.02, 0, 0, 2.2, 2.1), bits = c(120L,
130L, 100L, 190L, 390L, 400L, 67L, 69L), NAME = structure(c(1L,
2L, 3L, 4L, 6L, 5L, 2L, 1L), .Label = c("A", "B", "C", "X",
"Y", "Z"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
library(tidyverse)
tab %>%
filter(NAME != "") %>%
group_by(Groups,Species) %>%
arrange(evalue, desc(bits)) %>%
slice(1) %>%
select(Groups, Species, consensus_NAME = NAME) %>%
right_join(tab, by = c("Groups", "Species")) %>%
relocate(consensus_NAME, .after = NAME)
#> # A tibble: 8 x 6
#> # Groups: Groups, Species [4]
#> Groups Species evalue bits NAME consensus_NAME
#> <fct> <fct> <dbl> <int> <fct> <fct>
#> 1 G1 SP1 1 120 A B
#> 2 G1 SP1 0.5 130 B B
#> 3 G1 SP2 1.2 100 C C
#> 4 G1 SP3 0.02 190 X Y
#> 5 G1 SP3 0 390 Z Y
#> 6 G1 SP3 0 400 Y Y
#> 7 G2 SP1 2.2 67 B A
#> 8 G2 SP1 2.1 69 A A
使用dplyr
:
library(dplyr)
left_join(tab,
tab %>%
group_by(Groups, Species) %>%
mutate(diff = bits - evalue) %>%
filter(diff == max(diff)) %>%
select(Groups, NAME) %>%
rename(consensus_NAME = NAME),
by = c("Groups" = "Groups", "Species" = "Species"))
輸出:
Groups Species evalue bits NAME consensus_NAME
1 G1 SP1 1.00 120 A B
2 G1 SP1 0.50 130 B B
3 G1 SP2 1.20 100 C C
4 G1 SP3 0.02 190 X Y
5 G1 SP3 0.00 390 Z Y
6 G1 SP3 0.00 400 Y Y
7 G2 SP1 2.20 67 B A
8 G2 SP1 2.10 69 A A
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.