[英]get frequency based on two columns
我的大型數據框的一個片段看起來像這樣:
MARKERS.IN.HAPLOTYPES BASE rs. alleles chrom pos GID marker trial
1A.12 C S1A_494392059 C/G 1A 494392059 GID7173723 2 ES26-38
1A.13 C S1A_497201550 C/T 1A 497201550 GID7173723 0 ES26-38
1A.14 T S1A_499864157 C/T 1A 499864157 GID7173723 2 ES26-38
1B.10 A S1B_566171302 G/A 1B 566171302 GID7173723 0 ES26-38
1B.20 G S1B_642616640 A/G 1B 642616640 GID7173723 2 ES26-38
2B.10 A S2B_24883552 A/G 2B 24883552 GID7173723 2 ES26-38
這是它的dput
:
structure(list(MARKERS.IN.HAPLOTYPES = c("1A.12", "1A.13", "1A.14",
"1B.10", "1B.20", "2B.10"), BASE = c("C", "C", "T", "A", "G",
"A"), rs. = c("S1A_494392059", "S1A_497201550", "S1A_499864157",
"S1B_566171302", "S1B_642616640", "S2B_24883552"), alleles = c("C/G",
"C/T", "C/T", "G/A", "A/G", "A/G"), chrom = c("1A", "1A", "1A",
"1B", "1B", "2B"), pos = c(494392059L, 497201550L, 499864157L,
566171302L, 642616640L, 24883552L), GID = c("GID7173723", "GID7173723",
"GID7173723", "GID7173723", "GID7173723", "GID7173723"), marker = c("2",
"0", "2", "0", "2", "2"), trial = c("ES26-38", "ES26-38", "ES26-38",
"ES26-38", "ES26-38", "ES26-38")), row.names = c(NA, 6L), class =
"data.frame")
列rs.
有22個unique
值rs.
在原始數據框中,列trial
有六個unique
值。 我想為每個唯一的rs.
計算列marker
的不同值的相對頻率rs.
以及每個獨特的trial
。 例如, rs.
列的第一項rs.
S1A_494392059
將具有試驗ES26-38
的列marker
頻率,依此類推,依此類推。 請注意,列marker
是字符向量,而不是數字。
您可以嘗試以下方法:
library(dplyr)
df %>%
add_count(rs., trial, name = "Total") %>%
add_count(rs., trial, marker, name = "MarkerTotal") %>%
mutate(RelativeFreq = round(MarkerTotal / Total, 2))
add_count
的name
列是dplyr 0.8
add_count
的一項新功能,可讓您確定名稱(默認情況下為n
或nn
)。 如果您沒有最新的軟件包,上面的代碼將不起作用。
您的示例中的相對頻率到處都是1,因為它並不是特別復雜。
如果您想獲取匯總的數據框(僅剩下的列將對rs.
, trial
和RelativeFreq
進行分組),可以采用以下方法:
df %>%
add_count(rs., trial, marker, name = "MarkerTotal") %>%
group_by(rs., trial) %>%
summarise(RelativeFreq = round(MarkerTotal / n(), 2))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.