Essentially, I am trying to take make a bar chart using dplyr where there are several columns, say A, B, and C
Each column has a value classifying it, 0 or 1, if the row corresponds to that type of value
I am trying to make a bar chart using ggplot that shows the number of rows that contain a true value in each column. Any advice, at least on the syntax I'd follow?
Example:
A 1 1 1 0 0 0
B 0 0 0 1 0 0
C 0 0 0 0 1 1
I want to show the frequency of each, but as if those three were columns
Edit: I should note that I am trying to pull these from a larger data set, ex A, B, C, D, E, F, G, H.... but I only want A, B, and C
Try this
library(dplyr)
library(ggplot2)
library(tibble)
df <- as.data.frame(
rbind(
A = c(1, 1, 1, 0, 0, 0),
B = c(0, 0, 0, 1, 0, 0),
C = c(0, 0, 0, 0, 1, 1),
D = c(0, 0, 0, 0, 0, 0),
E = c(0, 0, 0, 0, 0, 0)
))
df %>%
# NOTE: name of id variable should not start with "v" or "V"
# Otherwise the select will not work.
rownames_to_column(var = "type") %>%
mutate(count = rowSums(select(., starts_with("V")), na.rm = TRUE)) %>%
select(type, count) %>%
filter(type %in% c("A", "B", "C")) %>%
ggplot(aes(type, count, fill = type)) +
geom_col() +
guides(fill = FALSE)
Created on 2020-03-15 by the reprex package (v0.3.0)
First of all, both the solution by @Chris and by @Jonathan are much cleaner and clearer than my approach and both are more efficinet. In terms of efficiency the base R solution by @Chris is however by far the most efficient (not only in terms of programmers efficiency (;). Results show that the base R solution gives a speedup compared to the tidyverse solutions by factor ~10. Whether this is crucial depends on the size of the dataset or ...
Here are the results:
I simply put the different solutions in functions (I only did some renaming) and did a microbenchmark. I also added a fourth function which adpats the code by @Chris to allow for flexible names.
library(dplyr)
library(tidyr)
library(ggplot2)
library(tibble)
# example data
df <- as.data.frame(
rbind(
A = c(1, 1, 1, 0, 0, 0),
B = c(0, 0, 0, 1, 0, 0),
C = c(0, 0, 0, 0, 1, 1),
D = c(0, 0, 0, 0, 0, 0),
E = c(0, 0, 0, 0, 0, 0)
))
# Tidyverse 1 using select & rowSums
sum_rows1 <- function(df) {
df %>%
# NOTE: name of id variable should not start with "v" or "V"
# Otherwise the select will not work.
rownames_to_column(var = "type") %>%
filter(type %in% c("A", "B", "C")) %>%
mutate(count = rowSums(select(., starts_with("V")), na.rm = TRUE)) %>%
select(type, count)
}
# Tidyverse 2 using pivot_longer
sum_rows2 <- function(df) {
df %>%
#Transpose the data
t() %>%
#Convert it as data.frame
as.data.frame() %>%
#Get data from wide to long format
pivot_longer(cols = everything(),
names_to = "type",
values_to = "value") %>%
#Filter to stay only with letters A, B, C
filter(type %in% c("A","B","C")) %>%
#group by var (i.e., letters)
group_by(type) %>%
#Get the sum of values per letter
summarize(count = sum(value))
}
# base R 1 with fixed names
sum_rows3 <- function(df) {
sum1 <- apply(t(df)[,1:3], 2, sum)
data.frame(type = LETTERS[1:3], count = sum1)
}
# base R 2 with flexible names
sum_rows4 <- function(df, cols) {
sum1 <- apply(t(df)[, cols], 2, sum)
data.frame(type = names(sum1), count = sum1)
}
(df1 <- sum_rows1(df))
#> type count
#> 1 A 3
#> 2 B 1
#> 3 C 2
(df2 <- sum_rows2(df))
#> # A tibble: 3 x 2
#> type count
#> <chr> <dbl>
#> 1 A 3
#> 2 B 1
#> 3 C 2
(df3 <- sum_rows3(df))
#> type count
#> A A 3
#> B B 1
#> C C 2
(df4 <- sum_rows4(df, c("A","B","C")))
#> type count
#> A A 3
#> B B 1
#> C C 2
# Benchmark the solutions
microbenchmark::microbenchmark(sum_rows1(df), sum_rows2(df), sum_rows3(df), sum_rows4(df, c("A","B","C")))
#> Unit: microseconds
#> expr min lq mean median uq
#> sum_rows1(df) 4239.5 4619.60 6079.313 6072.20 6771.15
#> sum_rows2(df) 3658.1 4085.55 5309.038 5225.95 5939.90
#> sum_rows3(df) 301.6 383.15 540.001 437.55 539.10
#> sum_rows4(df, c("A", "B", "C")) 302.6 387.05 533.977 469.05 546.40
#> max neval
#> 11238.7 100
#> 13808.2 100
#> 5018.6 100
#> 4106.9 100
Created on 2020-03-16 by the reprex package (v0.3.0)
Here is another solution using tidyverse
that uses two great functions ( pivot_longer
and summarize
) to organize the data and build the desired plot.
library(tidyverse)
df %>%
#Transpose the data
t() %>%
#Convert it as data.frame
as.data.frame() %>%
#Get data from wide to long format
pivot_longer(cols = everything(),
names_to = "var",
values_to = "value") %>%
#Filter to stay only with letters A, B, C
filter(var %in% c("A","B","C")) %>%
#group by var (i.e., letters)
group_by(var) %>%
#Get the sum of values per letter
summarize(sum = sum(value)) %>%
#ggplot with geom_col (i.e., columns plot)
ggplot(aes(x = var,
y = sum,
fill = var)) +
geom_col()
A simple base R
solution is this, using @stefan's data:
First, calculate the sums for each row in df
by transposing it (flipping rows into columns and vice versa) using t
as well as apply
, 2
for the rows in df
that have become columns in t(df)
, and sum
for sums:
sum1 <- apply(t(df)[,1:3], 2, sum)
Then create a dataframe with the relevant sequence of upper-case letters as the first variable and sum1
as the second variable:
sum2 <- data.frame(types = LETTERS[1:3], sum1)
And finally plot your barplot using sum2
as input data:
ggplot(sum2, aes(types, sum1, fill = types)) +
geom_col(fill = c("#009E00", "#F0E300", "#0066B2"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.