按R中的列计算值的频率

Question

Does anyone know how to replace the value of a cell with the frequency with which that value occurs in a column?有谁知道如何用该值在列中出现的频率替换单元格的值？ I'm trying to turn a dataframe full of breed labels and factors for genes into a frequency chart (with an eye later to seeing whether animals that common alleles for one gene tend to have common alleles for other genes, too).我正在尝试将一个充满品种标签和基因因子的数据框转换为频率图（稍后再观察动物，一个基因的共同等位基因是否也倾向于其他基因的共同等位基因）。 As an example, my initial dataframe looks like this:例如，我的初始数据框如下所示：

Breed    Gene A     Gene B    Gene C
Collie      3          5         8
Collie      5          7         2
Lab         3          3         1
Pug         3          7         8
Pug         3          7         9
Pug         4          4         9

And I'd like the result to look like this:我希望结果如下所示：

Breed    Gene A     Gene B    Gene C
2           4          1         2
2           1          3         1
1           4          1         1
3           4          3         1
3           4          3         2
3           1          1         2

I can see how to do this using a for loop (create new dataframe, loop over each column, loop over each row, change each value to a counter that goes up by one when it encounters an equal value), but is there a simpler and more efficient apply or dplyr approach?我可以看到如何使用 for 循环来做到这一点（创建新的数据框，循环每列，循环每行，将每个值更改为一个计数器，当它遇到相等的值时会增加一个），但是有没有更简单的和更有效的应用或 dplyr 方法？ The data set is large and I'm going to have re-run this often, and I'm concerned nested for loops will be too slow.数据集很大，我将经常重新运行，而且我担心嵌套的 for 循环会太慢。

Answer 1

Here's a base R option -这是一个基本的 R 选项 -

replace_value_by_count <- function(x) ave(x, x, FUN = length)
df[] <- lapply(df, replace_value_by_count)
df

#  Breed GeneA GeneB GeneC
#1     2     4     1     2
#2     2     1     3     1
#3     1     4     1     1
#4     3     4     3     2
#5     3     4     3     2
#6     3     1     1     2

Since you have tagged dplyr , the same function can also be used using dplyr .由于您已标记dplyr ，因此也可以使用dplyr使用相同的功能。

library(dplyr)
df <- df %>% mutate(across(.fns = replace_value_by_count))

data数据

df <- structure(list(Breed = c("Collie", "Collie", "Lab", "Pug", "Pug", 
"Pug"), GeneA = c(3L, 5L, 3L, 3L, 3L, 4L), GeneB = c(5L, 7L, 
3L, 7L, 7L, 4L), GeneC = c(8L, 2L, 1L, 8L, 9L, 9L)), 
class = "data.frame", row.names = c(NA, -6L))

Answer 2

We may use base R我们可以使用base R

df[] <- lapply(df, function(x) table(x)[as.character(x)])

-output -输出

> df
  Breed GeneA GeneB GeneC
1     2     4     1     2
2     2     1     3     1
3     1     4     1     1
4     3     4     3     2
5     3     4     3     2
6     3     1     1     2

Or using tidyverse或者使用tidyverse

library(dplyr)
df %>%
    mutate(across(everything(), ~ tibble(col1 = .x) %>% 
             add_count(col1) %>% 
             pull(n)))
  Breed GeneA GeneB GeneC
1     2     4     1     2
2     2     1     3     1
3     1     4     1     1
4     3     4     3     2
5     3     4     3     2
6     3     1     1     2

data数据

df <- structure(list(Breed = c("Collie", "Collie", "Lab", "Pug", "Pug", 
"Pug"), GeneA = c(3L, 5L, 3L, 3L, 3L, 4L), GeneB = c(5L, 7L, 
3L, 7L, 7L, 4L), GeneC = c(8L, 2L, 1L, 8L, 9L, 9L)),
   class = "data.frame", row.names = c(NA, 
-6L))

按R中的列计算值的频率

问题描述

2 个解决方案

解决方案1
2 2021-10-18 13:34:14

解决方案2
0 2021-10-18 16:20:38

data数据

按R中的列计算值的频率

问题描述

2 个解决方案

解决方案1 2 2021-10-18 13:34:14

解决方案2 0 2021-10-18 16:20:38

data数据

解决方案1
2 2021-10-18 13:34:14

解决方案2
0 2021-10-18 16:20:38