Box plot with numeric and categorical variables

Question

I want to create a box plot to visualize the distribution of multiple numerical variables with the same scale against one categorical variable in order to see the behaviour between the different measures for one specific level of the factor.

For example, I want to see how much differs the quantity (in thousands of $) of the shipments that 3 custumers order by the type of product. Take this example data:

prueba <- data.frame("client1" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 6.5, sd = 1),
                     "client2" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 6.9, sd = 2),
                     "client3" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 5, sd = 3),
                     "type" = as.factor(sample(LETTERS[1:3], 60, replace = T, prob = c(0.4,0.35,0.25))),
                     "cat" = as.factor(sample(LETTERS[20:22], 60, replace = T, prob = c(0.5, 0.1,0.4))))
prueba[,1:3] <- round(prueba[,1:3], 1)
head(prueba)
#  client1 client2 client3 type cat
#1     6.3     7.2     7.0    B   T
#2     7.2     6.5     3.5    C   T
#3     8.0     6.4     8.0    A   V
#4     8.0     7.4     7.0    A   V
#5     7.5     7.6     2.5    B   V
#6     7.0     9.0     3.7    A   V

With ggplot I can do this:

library(tidyverse)
library(patchwork)

uno <- prueba %>% ggplot(aes(x = type, 
                      y = client1)) +
        geom_boxplot()+scale_y_continuous(limits = c(0,10))

dos <- prueba %>% ggplot(aes(x = type, 
                             y = client2)) +
        geom_boxplot()

tres <- prueba %>% ggplot(aes(x = type, 
                              y = client3)) +
        geom_boxplot()

uno+dos+tres+plot_layout(byrow = F)

I get this: Differences in distributions:

However, I want something like this: Something like this:

But instead of that the x axis be filled with other categorie, I want that it be fill with the distribution of each client.

Is this possible?
How can I do this in R?
There are other visualization methods for do the same?

Answer 1

Are you looking for this something like this?

prueba2 <- prueba %>% 
  pivot_longer(cols = starts_with("client"), names_to = "client")

  ggplot(data = prueba2, aes(x = type, 
                             y = value, 
                             fill = client)) +
  geom_boxplot()

If so, first get all the client# columns into one column "client" with the corresponding values into another column "value" with pivot_longer (from the package tidyr, already in tidyverse). Then let ggplot do the rest - All we have to tell it is: x-axis is 'type', y-axis is 'value', and 'client' is the fill color.

Answer 2

I am not sure if I understand you correctly but if you want each level of client instead of each level of cat then you have to convert everything to long format:

prueba <- data.frame("client1" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 6.5, sd = 1),
                     "client2" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 6.9, sd = 2),
                     "client3" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 5, sd = 3),
                     "type" = as.factor(sample(LETTERS[1:3], 60, replace = T, prob = c(0.4,0.35,0.25))),
                     "cat" = as.factor(sample(LETTERS[20:22], 60, replace = T, prob = c(0.5, 0.1,0.4))))
prueba[,1:3] <- round(prueba[,1:3], 1)

library(reshape2)

prueba_long <- melt(prueba,  id.vars = c('type', 'cat'))

ggplot(prueba_long, aes(x = type, y = value, colour = variable)) +
  geom_boxplot()

Box plot with numeric and categorical variables

Question

2 answers

solution1
2 ACCPTED 2020-04-09 19:06:58

solution2
1 2020-04-09 19:10:01

Box plot with numeric and categorical variables

Question

2 answers

solution1 2 ACCPTED 2020-04-09 19:06:58

solution2 1 2020-04-09 19:10:01

solution1
2 ACCPTED 2020-04-09 19:06:58

solution2
1 2020-04-09 19:10:01