简体   繁体   中英

ggplot: Why do I have to transform the data into the long format?

When plotting with ggplot, I often have to transforme the data into the long format, for example, like in the code below. Two questions arise for me:

  1. Is there a way to use the column (so each variable) as a "group"? So each column is plotted and has a different color? Hence it would not be necessary to transform the data to the long format. (Without putting every variable into a geom_line() )
  2. Why is it the case that you have to transform the data into the long format? What is the reason behind it? How is it better than plotting when your data has the wide format?

The example code:

library(tidyverse) 
# Data in wide format
  df_wide <- data.frame(
   Horizons = seq(1,10,1),
   Country1 = c(2.5, 2.3, 2.2, 2.2, 2.1, 2.0, 1.7, 1.8, 1.7, 1.6),
   Country2 = c(3.5, 3.3, 3.2, 3.2, 3.1, 3.0, 3.7, 3.8, 3.7, 3.6),
   Country3 = c(1.5, 1.3, 1.2, 1.2, 1.1, 1.0, 0.7, 0.8, 0.7, 0.6)
   )

# Convert to long format
  df_long <- df_wide %>%
   gather(key = "variable", value = "value", -Horizons)
    
# Plot the lines
  plotstov <- ggplot(df_long, aes(x = Horizons, y = value)) + 
   geom_line(aes(colour = variable, group = variable))+
   theme_bw() 

Output: 输出 Thanks a lot in advance!

It's hard to be say for sure that this is impossible — for example, someone could write a wrapper package for ggplot that would do this automatically for you — but there's no obvious solution like this.

Hadley Wickham, the author of ggplot , has built the entire "tidyverse" ecosystem on the concept of tidy data , which is essentially data in long format. The basic reason for working with long-format data is that the same data can be represented by many wide formats, but the long format is typically unique. For example, suppose you have data representing revenues by year, country, and industrial sector. In a wide format, do columns represent year, country, sector, or some combination? In the tidyverse/ggplot world, you can simply specify which variable you want to use as the grouping variable. With a wide-format-oriented tool (such as base R's matplot ), you would first reshape your data so that the columns represented the grouping variable (say, years), then plot it.

Wickham and co-workers built tools like gather (or pivot_longer in newer versions of the tidyverse) to make conversion to long format easy, and a wide variety of other tools to work with long ("tidy") data.

You could write wrappers around ggplot that would do the conversion ...

As I can see, you already have the answer to your second question, so I'll focus here on the first question.

Answer 2 - Yes, there is a way to plot each column separately, specifying them as follows:

# load environment
library(ggplot2)
# create dataframe
df <- data.frame(
  Horizons = seq(1,10,1),
  Country1 = c(2.5, 2.3, 2.2, 2.2, 2.1, 2.0, 1.7, 1.8, 1.7, 1.6),
  Country2 = c(3.5, 3.3, 3.2, 3.2, 3.1, 3.0, 3.7, 3.8, 3.7, 3.6),
  Country3 = c(1.5, 1.3, 1.2, 1.2, 1.1, 1.0, 0.7, 0.8, 0.7, 0.6)
)
# plot
ggplot(df) +
  geom_line(aes(x = Horizons, y = Country1, colour = 'Country1')) +
  geom_line(aes(x = Horizons, y = Country2, colour = 'Country2')) +
  geom_line(aes(x = Horizons, y = Country3, colour = 'Country3'))
  theme_bw()

Output:

在此处输入图片说明

As you can see, you have to specify each column, which may cause problems when you have a large dataset. As experienced by me, it's much harder to define which will be the colour of each line, since scale_colour_manual() gets confused when the dataset is not structured in a long format with a column describing the label/colour of each row.

It may be useful to use a wide format sometimes, but I advise you to always use the long format. You will make a smarter use of tidyverse packages.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM