How can I pull frequency data from multiple columns to make a bar chart?

Question

Essentially, I am trying to take make a bar chart using dplyr where there are several columns, say A, B, and C

Each column has a value classifying it, 0 or 1, if the row corresponds to that type of value

I am trying to make a bar chart using ggplot that shows the number of rows that contain a true value in each column. Any advice, at least on the syntax I'd follow?

Example:

A 1 1 1 0 0 0 

B 0 0 0 1 0 0

C 0 0 0 0 1 1

I want to show the frequency of each, but as if those three were columns

Edit: I should note that I am trying to pull these from a larger data set, ex A, B, C, D, E, F, G, H.... but I only want A, B, and C

Answer 1

Try this

library(dplyr)
library(ggplot2)
library(tibble)

df <- as.data.frame(
  rbind(
    A = c(1, 1, 1, 0, 0, 0),
    B = c(0, 0, 0, 1, 0, 0),
    C = c(0, 0, 0, 0, 1, 1),
    D = c(0, 0, 0, 0, 0, 0),
    E = c(0, 0, 0, 0, 0, 0)
))

df %>%
  # NOTE: name of id variable should not start with "v" or "V"
  # Otherwise the select will not work.
  rownames_to_column(var = "type") %>% 
  mutate(count = rowSums(select(., starts_with("V")), na.rm = TRUE)) %>% 
  select(type, count) %>% 
  filter(type %in% c("A", "B", "C")) %>% 
  ggplot(aes(type, count, fill = type)) +
  geom_col() +
  guides(fill = FALSE)

^{Created on 2020-03-15 by the reprex package (v0.3.0)}

Update

First of all, both the solution by @Chris and by @Jonathan are much cleaner and clearer than my approach and both are more efficinet. In terms of efficiency the base R solution by @Chris is however by far the most efficient (not only in terms of programmers efficiency (;). Results show that the base R solution gives a speedup compared to the tidyverse solutions by factor ~10. Whether this is crucial depends on the size of the dataset or ...

Here are the results:

I simply put the different solutions in functions (I only did some renaming) and did a microbenchmark. I also added a fourth function which adpats the code by @Chris to allow for flexible names.

library(dplyr)
library(tidyr)
library(ggplot2)
library(tibble)

# example data
df <- as.data.frame(
  rbind(
    A = c(1, 1, 1, 0, 0, 0),
    B = c(0, 0, 0, 1, 0, 0),
    C = c(0, 0, 0, 0, 1, 1),
    D = c(0, 0, 0, 0, 0, 0),
    E = c(0, 0, 0, 0, 0, 0)
  ))

# Tidyverse 1 using select & rowSums
sum_rows1 <- function(df) {
  df %>%
    # NOTE: name of id variable should not start with "v" or "V"
    # Otherwise the select will not work.
    rownames_to_column(var = "type") %>%
    filter(type %in% c("A", "B", "C")) %>% 
    mutate(count = rowSums(select(., starts_with("V")), na.rm = TRUE)) %>% 
    select(type, count)
}
# Tidyverse 2 using pivot_longer
sum_rows2 <- function(df) {
  df %>%
    #Transpose the data
    t() %>%
    #Convert it as data.frame
    as.data.frame() %>%
    #Get data from wide to long format 
    pivot_longer(cols = everything(),
                 names_to = "type",
                 values_to = "value") %>%
    #Filter to stay only with letters A, B, C
    filter(type %in% c("A","B","C")) %>%
    #group by var (i.e., letters)
    group_by(type) %>%
    #Get the sum of values per letter
    summarize(count = sum(value))
}

# base R 1 with fixed names
sum_rows3 <- function(df) {
  sum1 <- apply(t(df)[,1:3], 2, sum)
  data.frame(type = LETTERS[1:3], count = sum1)
}

# base R 2 with flexible names
sum_rows4 <- function(df, cols) {
  sum1 <- apply(t(df)[, cols], 2, sum)
  data.frame(type = names(sum1), count = sum1)
}

(df1 <- sum_rows1(df))
#>   type count
#> 1    A     3
#> 2    B     1
#> 3    C     2
(df2 <- sum_rows2(df))
#> # A tibble: 3 x 2
#>   type  count
#>   <chr> <dbl>
#> 1 A         3
#> 2 B         1
#> 3 C         2
(df3 <- sum_rows3(df))
#>   type count
#> A    A     3
#> B    B     1
#> C    C     2
(df4 <- sum_rows4(df, c("A","B","C")))
#>   type count
#> A    A     3
#> B    B     1
#> C    C     2

# Benchmark the solutions
microbenchmark::microbenchmark(sum_rows1(df), sum_rows2(df), sum_rows3(df), sum_rows4(df, c("A","B","C")))
#> Unit: microseconds
#>                             expr    min      lq     mean  median      uq
#>                    sum_rows1(df) 4239.5 4619.60 6079.313 6072.20 6771.15
#>                    sum_rows2(df) 3658.1 4085.55 5309.038 5225.95 5939.90
#>                    sum_rows3(df)  301.6  383.15  540.001  437.55  539.10
#>  sum_rows4(df, c("A", "B", "C"))  302.6  387.05  533.977  469.05  546.40
#>      max neval
#>  11238.7   100
#>  13808.2   100
#>   5018.6   100
#>   4106.9   100

^{Created on 2020-03-16 by the reprex package (v0.3.0)}

Answer 2

Here is another solution using tidyverse that uses two great functions ( pivot_longer and summarize ) to organize the data and build the desired plot.

library(tidyverse)

df %>%
  #Transpose the data
  t() %>%
  #Convert it as data.frame
  as.data.frame() %>%
  #Get data from wide to long format 
  pivot_longer(cols = everything(),
               names_to = "var",
               values_to = "value") %>%
  #Filter to stay only with letters A, B, C
  filter(var %in% c("A","B","C")) %>%
  #group by var (i.e., letters)
  group_by(var) %>%
  #Get the sum of values per letter
  summarize(sum = sum(value)) %>%
  #ggplot with geom_col (i.e., columns plot)
  ggplot(aes(x = var,
             y = sum,
             fill = var)) +
  geom_col()

Answer 3

A simple base R solution is this, using @stefan's data:

First, calculate the sums for each row in df by transposing it (flipping rows into columns and vice versa) using t as well as apply , 2 for the rows in df that have become columns in t(df) , and sum for sums:

sum1 <- apply(t(df)[,1:3], 2, sum)

Then create a dataframe with the relevant sequence of upper-case letters as the first variable and sum1 as the second variable:

sum2 <- data.frame(types = LETTERS[1:3], sum1)

And finally plot your barplot using sum2 as input data:

ggplot(sum2, aes(types, sum1, fill = types))  +  
    geom_col(fill = c("#009E00", "#F0E300", "#0066B2"))

How can I pull frequency data from multiple columns to make a bar chart?

Question

3 answers

solution1
2 ACCPTED 2020-03-15 09:36:19

Update

solution2
2 2020-03-15 16:46:28

solution3
1 2020-03-15 16:39:36

How can I pull frequency data from multiple columns to make a bar chart?

Question

3 answers

solution1 2 ACCPTED 2020-03-15 09:36:19

Update

solution2 2 2020-03-15 16:46:28

solution3 1 2020-03-15 16:39:36

solution1
2 ACCPTED 2020-03-15 09:36:19

solution2
2 2020-03-15 16:46:28

solution3
1 2020-03-15 16:39:36