[英]Calculating frequency of values by column in R
Does anyone know how to replace the value of a cell with the frequency with which that value occurs in a column?有谁知道如何用该值在列中出现的频率替换单元格的值? I'm trying to turn a dataframe full of breed labels and factors for genes into a frequency chart (with an eye later to seeing whether animals that common alleles for one gene tend to have common alleles for other genes, too).我正在尝试将一个充满品种标签和基因因子的数据框转换为频率图(稍后再观察动物,一个基因的共同等位基因是否也倾向于其他基因的共同等位基因)。 As an example, my initial dataframe looks like this:例如,我的初始数据框如下所示:
Breed Gene A Gene B Gene C
Collie 3 5 8
Collie 5 7 2
Lab 3 3 1
Pug 3 7 8
Pug 3 7 9
Pug 4 4 9
And I'd like the result to look like this:我希望结果如下所示:
Breed Gene A Gene B Gene C
2 4 1 2
2 1 3 1
1 4 1 1
3 4 3 1
3 4 3 2
3 1 1 2
I can see how to do this using a for loop (create new dataframe, loop over each column, loop over each row, change each value to a counter that goes up by one when it encounters an equal value), but is there a simpler and more efficient apply or dplyr approach?我可以看到如何使用 for 循环来做到这一点(创建新的数据框,循环每列,循环每行,将每个值更改为一个计数器,当它遇到相等的值时会增加一个),但是有没有更简单的和更有效的应用或 dplyr 方法? The data set is large and I'm going to have re-run this often, and I'm concerned nested for loops will be too slow.数据集很大,我将经常重新运行,而且我担心嵌套的 for 循环会太慢。
Here's a base R option -这是一个基本的 R 选项 -
replace_value_by_count <- function(x) ave(x, x, FUN = length)
df[] <- lapply(df, replace_value_by_count)
df
# Breed GeneA GeneB GeneC
#1 2 4 1 2
#2 2 1 3 1
#3 1 4 1 1
#4 3 4 3 2
#5 3 4 3 2
#6 3 1 1 2
Since you have tagged dplyr
, the same function can also be used using dplyr
.由于您已标记dplyr
,因此也可以使用dplyr
使用相同的功能。
library(dplyr)
df <- df %>% mutate(across(.fns = replace_value_by_count))
data数据
df <- structure(list(Breed = c("Collie", "Collie", "Lab", "Pug", "Pug",
"Pug"), GeneA = c(3L, 5L, 3L, 3L, 3L, 4L), GeneB = c(5L, 7L,
3L, 7L, 7L, 4L), GeneC = c(8L, 2L, 1L, 8L, 9L, 9L)),
class = "data.frame", row.names = c(NA, -6L))
We may use base R
我们可以使用base R
df[] <- lapply(df, function(x) table(x)[as.character(x)])
-output -输出
> df
Breed GeneA GeneB GeneC
1 2 4 1 2
2 2 1 3 1
3 1 4 1 1
4 3 4 3 2
5 3 4 3 2
6 3 1 1 2
Or using tidyverse
或者使用tidyverse
library(dplyr)
df %>%
mutate(across(everything(), ~ tibble(col1 = .x) %>%
add_count(col1) %>%
pull(n)))
Breed GeneA GeneB GeneC
1 2 4 1 2
2 2 1 3 1
3 1 4 1 1
4 3 4 3 2
5 3 4 3 2
6 3 1 1 2
df <- structure(list(Breed = c("Collie", "Collie", "Lab", "Pug", "Pug",
"Pug"), GeneA = c(3L, 5L, 3L, 3L, 3L, 4L), GeneB = c(5L, 7L,
3L, 7L, 7L, 4L), GeneC = c(8L, 2L, 1L, 8L, 9L, 9L)),
class = "data.frame", row.names = c(NA,
-6L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.