简体   繁体   中英

Box plot with numeric and categorical variables

I want to create a box plot to visualize the distribution of multiple numerical variables with the same scale against one categorical variable in order to see the behaviour between the different measures for one specific level of the factor.

For example, I want to see how much differs the quantity (in thousands of $) of the shipments that 3 custumers order by the type of product. Take this example data:

prueba <- data.frame("client1" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 6.5, sd = 1),
                     "client2" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 6.9, sd = 2),
                     "client3" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 5, sd = 3),
                     "type" = as.factor(sample(LETTERS[1:3], 60, replace = T, prob = c(0.4,0.35,0.25))),
                     "cat" = as.factor(sample(LETTERS[20:22], 60, replace = T, prob = c(0.5, 0.1,0.4))))
prueba[,1:3] <- round(prueba[,1:3], 1)
head(prueba)
#  client1 client2 client3 type cat
#1     6.3     7.2     7.0    B   T
#2     7.2     6.5     3.5    C   T
#3     8.0     6.4     8.0    A   V
#4     8.0     7.4     7.0    A   V
#5     7.5     7.6     2.5    B   V
#6     7.0     9.0     3.7    A   V

With ggplot I can do this:

library(tidyverse)
library(patchwork)

uno <- prueba %>% ggplot(aes(x = type, 
                      y = client1)) +
        geom_boxplot()+scale_y_continuous(limits = c(0,10))

dos <- prueba %>% ggplot(aes(x = type, 
                             y = client2)) +
        geom_boxplot()

tres <- prueba %>% ggplot(aes(x = type, 
                              y = client3)) +
        geom_boxplot()

uno+dos+tres+plot_layout(byrow = F)

I get this: Differences in distributions:
分布差异

However, I want something like this: Something like this:
像这样的东西

But instead of that the x axis be filled with other categorie, I want that it be fill with the distribution of each client.

  1. Is this possible?

  2. How can I do this in R?

  3. There are other visualization methods for do the same?

Are you looking for this something like this?

prueba2 <- prueba %>% 
  pivot_longer(cols = starts_with("client"), names_to = "client")

  ggplot(data = prueba2, aes(x = type, 
                             y = value, 
                             fill = client)) +
  geom_boxplot() 

在此处输入图像描述

If so, first get all the client# columns into one column "client" with the corresponding values into another column "value" with pivot_longer (from the package tidyr, already in tidyverse). Then let ggplot do the rest - All we have to tell it is: x-axis is 'type', y-axis is 'value', and 'client' is the fill color.

I am not sure if I understand you correctly but if you want each level of client instead of each level of cat then you have to convert everything to long format:

prueba <- data.frame("client1" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 6.5, sd = 1),
                     "client2" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 6.9, sd = 2),
                     "client3" = truncnorm::rtruncnorm(n = 60, a = 1, b = 9.8, mean = 5, sd = 3),
                     "type" = as.factor(sample(LETTERS[1:3], 60, replace = T, prob = c(0.4,0.35,0.25))),
                     "cat" = as.factor(sample(LETTERS[20:22], 60, replace = T, prob = c(0.5, 0.1,0.4))))
prueba[,1:3] <- round(prueba[,1:3], 1)

library(reshape2)

prueba_long <- melt(prueba,  id.vars = c('type', 'cat'))

ggplot(prueba_long, aes(x = type, y = value, colour = variable)) +
  geom_boxplot()

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM