[英]Use highest value in column per level to calculate relative values in data.table
I'm preparing data for a heatmap and I want to plot the changes relative to the highest value. 我正在为热图准备数据,我想绘制相对于最高值的更改。 I want to compare the patterns and not the absolute abundances per id
and also limit the scale of the heatmap to 0 to 100 %. 我想比较模式而不是每个id
的绝对丰度,并且还要将热图的比例限制为0到100%。
This is my data: 这是我的数据:
head(kallisto_melt,14)
id protein_name variable value relative_abundance
1: BIJBGGEO_00001 hypothetical protein tpm_A1 0.0000000 NA
2: BIJBGGEO_00001 hypothetical protein tpm_A2 0.0000000 NA
3: BIJBGGEO_00001 hypothetical protein tpm_A3 0.0000000 NA
4: BIJBGGEO_00001 hypothetical protein tpm_A4 0.0000000 NA
5: BIJBGGEO_00001 hypothetical protein tpm_A5 0.0000000 NA
6: BIJBGGEO_00001 hypothetical protein tpm_A6 0.0000000 NA
7: BIJBGGEO_00001 hypothetical protein tpm_A7 0.0000000 NA
8: BIJBGGEO_00002 hypothetical protein tpm_A1 0.0000000 NA
9: BIJBGGEO_00002 hypothetical protein tpm_A2 0.0000000 NA
10: BIJBGGEO_00002 hypothetical protein tpm_A3 0.0000000 NA
11: BIJBGGEO_00002 hypothetical protein tpm_A4 0.0703664 NA
12: BIJBGGEO_00002 hypothetical protein tpm_A5 0.0000000 NA
13: BIJBGGEO_00002 hypothetical protein tpm_A6 0.0000000 NA
14: BIJBGGEO_00002 hypothetical protein tpm_A7 0.0863996 NA
I tried to add a column of relative values, which sets the highest value
per id
to 100 % and the other ones accordingly. 我试图添加一列相对值,该列将每个id
的最高value
设置为100%,其他值也相应地设置。 I could imagine that all zeroes result in NA (the first 7 rows), but for the second id
I expected something likes this: 我可以想象所有零都将导致NA(前7行),但是对于第二个id
我希望是这样的:
id protein_name variable value relative_abundance
1: BIJBGGEO_00001 hypothetical protein tpm_A1 0.0000000 NA
2: BIJBGGEO_00001 hypothetical protein tpm_A2 0.0000000 NA
3: BIJBGGEO_00001 hypothetical protein tpm_A3 0.0000000 NA
4: BIJBGGEO_00001 hypothetical protein tpm_A4 0.0000000 NA
5: BIJBGGEO_00001 hypothetical protein tpm_A5 0.0000000 NA
6: BIJBGGEO_00001 hypothetical protein tpm_A6 0.0000000 NA
7: BIJBGGEO_00001 hypothetical protein tpm_A7 0.0000000 NA
8: BIJBGGEO_00002 hypothetical protein tpm_A1 0.0000000 0
9: BIJBGGEO_00002 hypothetical protein tpm_A2 0.0000000 0
10: BIJBGGEO_00002 hypothetical protein tpm_A3 0.0000000 0
11: BIJBGGEO_00002 hypothetical protein tpm_A4 0.0703664 "somewhere about 81"
12: BIJBGGEO_00002 hypothetical protein tpm_A5 0.0000000 0
13: BIJBGGEO_00002 hypothetical protein tpm_A6 0.0000000 0
14: BIJBGGEO_00002 hypothetical protein tpm_A7 0.0863996 100
I adapted code I once asked for here R how to calculate relative values based on a long format data.frame column? 我修改了我曾经在这里要求的代码R如何根据长格式的data.frame列计算相对值?
and it looks like this: 它看起来像这样:
kallisto_melt[,relative_abundance := value/(value[max(value)]*100), by = .(id)]
what am I doing wrong? 我究竟做错了什么?
use this code :- you will be able to find it. 使用此代码:-您将能够找到它。
library(dplyr)
df1 <- df %>%
group_by(id,protein_name) %>%
mutate(relative_abundance = value/max(value)*100)
df1[is.na(df1)] <- 0
Data :- 数据:-
df<- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("BIJBGGEO_00001", "BIJBGGEO_00002"
), class = "factor"), protein_name = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "hypothetical protein", class = "factor"),
variable = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L,
3L, 4L, 5L, 6L, 7L), .Label = c("tpm_A1", "tpm_A2", "tpm_A3",
"tpm_A4", "tpm_A5", "tpm_A6", "tpm_A7"), class = "factor"),
value = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0703664, 0, 0,
0.0863996), relative_abundance = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-14L))
With data.table
, we can do 有了data.table
,我们可以做
# setDT(kallisto_melt)
kallisto_melt[, relative_abundance := value / max(value) * 100, by = id]
kallisto_melt[is.na(relative_abundance), relative_abundance := 0]
kallisto_melt
# id protein_name variable value #relative_abundance
# 1: BIJBGGEO_00001 hypothetical protein tpm_A1 0.0000000 0.00000
# 2: BIJBGGEO_00001 hypothetical protein tpm_A2 0.0000000 0.00000
# 3: BIJBGGEO_00001 hypothetical protein tpm_A3 0.0000000 0.00000
# 4: BIJBGGEO_00001 hypothetical protein tpm_A4 0.0000000 0.00000
# 5: BIJBGGEO_00001 hypothetical protein tpm_A5 0.0000000 0.00000
# 6: BIJBGGEO_00001 hypothetical protein tpm_A6 0.0000000 0.00000
# 7: BIJBGGEO_00001 hypothetical protein tpm_A7 0.0000000 0.00000
# 8: BIJBGGEO_00002 hypothetical protein tpm_A1 0.0000000 0.00000
# 9: BIJBGGEO_00002 hypothetical protein tpm_A2 0.0000000 0.00000
#10: BIJBGGEO_00002 hypothetical protein tpm_A3 0.0000000 0.00000
#11: BIJBGGEO_00002 hypothetical protein tpm_A4 0.0703664 81.44297
#12: BIJBGGEO_00002 hypothetical protein tpm_A5 0.0000000 0.00000
#13: BIJBGGEO_00002 hypothetical protein tpm_A6 0.0000000 0.00000
#14: BIJBGGEO_00002 hypothetical protein tpm_A7 0.0863996 100.00000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.