简体   繁体   English

多列表示一个值时的频率表 (R)

[英]Frequency table when there are multiple columns representing one value (R)

I have a dataset like this:我有一个这样的数据集:

ID    color1   color2  color3   shape1       shape2        size
55    red     blue     NA       circle       triangle      small
67    yellow  NA       NA       triangle     NA            medium
83    blue    yellow   NA       circle       NA            large
78    red     yellow   blue     square       circle        large
43    green   NA       NA       square       circle        small
29    yellow  green    NA       circle       triangle      medium

I would like to create a dataframe where I have the frequency and percentage of each variable, but I am having trouble because there are multiple columns of the same variable in some cases.我想创建一个 dataframe ,其中我有每个变量的频率和百分比,但是我遇到了麻烦,因为在某些情况下有多个列相同的变量。


Variable      Level        Freq        Percent 
 
color         blue          3           27.27
              red           2           18.18
              yellow        4           36.36
              green         2           18.18
              total         11          100.00

shape         circle        5           50.0       
              triangle      3           30.0
              square        2           20.0
              total         10          100.0

size          small         2           33.3
              medium        2           33.3
              large         2           33.3
              total         6           100.0

I believe that I need to convert these variables to long and then use summarize/mutate to get the frequencies, but I can't seem to figure it out.我相信我需要将这些变量转换为 long,然后使用 summarise/mutate 来获取频率,但我似乎无法弄清楚。 Any help is greatly appreciated.任何帮助是极大的赞赏。

You can use tidyverse package to transform the data into a long format and then just summarise the desired stats.您可以使用tidyverse package 将数据转换为长格式,然后汇总所需的统计信息。

library(tidyverse)

df |> 
  # Transform all columns into a long format
  pivot_longer(cols = -ID,
               names_pattern = "([A-z]+)",
               names_to = c("variable")) |>
  # Drop NA entries
  drop_na(value) |>
  # Group by variable
  group_by(variable) |>
  # Count
  count(value) |>
  # Calculate percentage as n / sum of n by variable
  mutate(perc = 100* n / sum(n))

# A tibble: 10 x 4
# Groups:   variable [3]
#   variable value        n  perc
#   <chr>    <chr>    <int> <dbl>
# 1 color    blue         3  27.3
# 2 color    green        2  18.2
# 3 color    red          2  18.2
# 4 color    yellow       4  36.4
# 5 shape    circle       5  50  
# 6 shape    square       2  20  
# 7 shape    triangle     3  30  
# 8 size     large        2  33.3
# 9 size     medium       2  33.3
#10 size     small        2  33.3

Combining and adding to:合并并添加到:

  1. Merge multiple frequency tables together in R 在 R 中将多个频率表合并在一起

  2. Adding a column of total n for each group in a stacked frequency table 在堆叠频率表中为每个组添加一列总 n


library(dplyr)
library(tidyr)
library(janitor)

options(digits = 3)

df %>% 
  pivot_longer(
    -ID,
    names_to = "Variable",
    values_to = "Level"
  ) %>% 
  mutate(Variable = str_extract(Variable, '[A-Za-z]*')) %>% 
  group_by(Variable, Level) %>% 
  count(Level, name = "Freq") %>% 
  na.omit() %>% 
  group_by(Variable) %>% 
  mutate(Percent = Freq/sum(Freq)*100) %>% 
  group_split() %>% 
  adorn_totals() %>% 
  bind_rows() %>% 
  mutate(Level = ifelse(Level == last(Level), last(Variable), Level)) %>% 
  mutate(Variable = ifelse(duplicated(Variable) |
                             Variable == "Total", NA, Variable))

 Variable    Level Freq Percent
    color     blue    3    27.3
     <NA>    green    2    18.2
     <NA>      red    2    18.2
     <NA>   yellow    4    36.4
     <NA>    Total   11   100.0
    shape   circle    5    50.0
     <NA>   square    2    20.0
     <NA> triangle    3    30.0
     <NA>    Total   10   100.0
     size    large    2    33.3
     <NA>   medium    2    33.3
     <NA>    small    2    33.3
     <NA>    Total    6   100.0

Try this matrix in list in base R基础 Rlist中尝试此matrix

uniq <- unique( sub( "[0-9]","", colnames(dat[,-1]) ) )
uniq
[1] "color" "shape" "size"

sapply( uniq, function(x){ tbl <- table( unlist( dat[,grep( x, colnames(dat) )] ) ); 
  rbind( cbind( Percent=tbl/sum(tbl)*100, Freq=tbl ), 
         cbind( sum(tbl/sum(tbl)*100), sum(tbl) ) ) } )
$color
         Percent Freq
blue    27.27273    3
green   18.18182    2
red     18.18182    2
yellow  36.36364    4
       100.00000   11

$shape
         Percent Freq
circle        50    5
square        20    2
triangle      30    3
             100   10

$size
         Percent Freq
large   33.33333    2
medium  33.33333    2
small   33.33333    2
       100.00000    6

Data数据

dat <- structure(list(ID = c(55L, 67L, 83L, 78L, 43L, 29L), color1 = c("red", 
"yellow", "blue", "red", "green", "yellow"), color2 = c("blue", 
NA, "yellow", "yellow", NA, "green"), color3 = c(NA, NA, NA, 
"blue", NA, NA), shape1 = c("circle", "triangle", "circle", "square", 
"square", "circle"), shape2 = c("triangle", NA, NA, "circle", 
"circle", "triangle"), size = c("small", "medium", "large", "large", 
"small", "medium")), class = "data.frame", row.names = c(NA, 
-6L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM