简体   繁体   English

ggplot:创建比例线 plot

[英]ggplot: Create line plot of proportions

I have a data.table of 59101 observations and I want to make a line plot, not a barplot , of proportions of my exposure column by the unique values in the month.year column (ie Jan 2014,Feb 2014,...,March 2020 ).我有一个 data.table 的 59101 个观察值,我想根据month.year列中的唯一值(即2014 年 1 月,2014 年 2 月,...,2020 年 3 月)绘制一条线 plot,而不是条形图,我的exposure列的比例).

library(data.table)
set.seed(1992)
DT <- data.table::data.table(ID=seq(1,59101),
                             exposure=as.factor(sample(letters[1:10],59101,replace = TRUE)),
                             index_date=sample(seq(as.Date('2014/01/01'), as.Date('2020/03/31'), by="day"),59101,replace = TRUE))
DT[,month.year:=as.factor(zoo::as.yearmon(index_date)),]
DT

> DT
          ID exposure index_date month.year
    1:     1        g 2017-11-26   Nov 2017
    2:     2        i 2019-10-11   Oct 2019
    3:     3        i 2015-02-27   Feb 2015
    4:     4        f 2016-10-18   Oct 2016
    5:     5        e 2019-06-06   Jun 2019
   ---                                     
59097: 59097        e 2015-07-22   Jul 2015
59098: 59098        j 2017-09-04   Sep 2017
59099: 59099        a 2018-04-26   Apr 2018
59100: 59100        a 2019-12-02   Dec 2019
59101: 59101        g 2014-11-04   Nov 2014

I want my data to look like the image I attached to my question (seen below)我希望我的数据看起来像我附加到我的问题的图像(见下文) 按月.年排序的曝光比例

I've been able to produce plots with ggplot in the past but I found that I struggled most with prepping the data, so my code ends up being rather rudimentary and ad hoc;过去我已经能够使用 ggplot 生成绘图,但我发现我在准备数据方面最费劲,所以我的代码最终变得相当简陋和临时; I would like to keep my code as simple and clean as possible.我想让我的代码尽可能简单和干净。

I've seen that melting data is the best approach when it comes to plotting with ggplot but I haven't built the brain muscle memory to instinctly know how to 1) prep the data in that manner and 2) pass the data through ggplot syntax to create what I need.我已经看到,在使用 ggplot 进行绘图时,融化数据是最好的方法,但我还没有建立大脑肌肉 memory 以本能地知道如何 1) 以这种方式准备数据和 2) 通过 ggplot 语法传递数据创造我需要的东西。

If anyone knows or has advice of how to do this, it would be greatly appreciated.如果有人知道或建议如何执行此操作,将不胜感激。

Here's a solution using tidyverse.这是使用 tidyverse 的解决方案。 It's a lot of data to plot, sometimes small multiples would be better. plot 的数据很多,有时小的倍数会更好。

library(data.table)
set.seed(1992)
DT <- data.table::data.table(ID=seq(1,59101),
                             exposure=as.factor(sample(letters[1:10],59101,replace = TRUE)),
                             index_date=sample(seq(as.Date('2014/01/01'), as.Date('2020/03/31'), by="day"),59101,replace = TRUE))
DT[,month.year:=as.factor(zoo::as.yearmon(index_date)),]


library(tidyverse)

DT %>%  as_tibble() %>% 
  group_by(month.year, exposure) %>% 
  count() %>% 
  ungroup() %>% 
  group_by(month.year) %>% 
  mutate(ttl = sum(n),
         pct_ttl = n / ttl,
         date = lubridate::myd(month.year, truncated = 1L)) %>% 
  print(n = 20) %>% 
  ggplot(aes(date, pct_ttl, color = exposure, group = exposure)) +
  geom_line() +
  scale_x_date(scale_x_date(date_breaks = "4 months", date_labels = "%b %Y"))+
  scale_y_continuous(label = scales::percent_format(accuracy = 1)) +
  theme(axis.text.x = element_text(angle = 90))+
  labs(y = "proportion", x = "")
#> # A tibble: 750 x 6
#> # Groups:   month.year [75]
#>    month.year exposure     n   ttl pct_ttl date      
#>    <fct>      <fct>    <int> <int>   <dbl> <date>    
#>  1 Jan 2014   a           66   793  0.0832 2014-01-01
#>  2 Jan 2014   b           83   793  0.105  2014-01-01
#>  3 Jan 2014   c           66   793  0.0832 2014-01-01
#>  4 Jan 2014   d           93   793  0.117  2014-01-01
#>  5 Jan 2014   e           76   793  0.0958 2014-01-01
#>  6 Jan 2014   f           71   793  0.0895 2014-01-01
#>  7 Jan 2014   g           87   793  0.110  2014-01-01
#>  8 Jan 2014   h           77   793  0.0971 2014-01-01
#>  9 Jan 2014   i           87   793  0.110  2014-01-01
#> 10 Jan 2014   j           87   793  0.110  2014-01-01
#> 11 Feb 2014   a           79   708  0.112  2014-02-01
#> 12 Feb 2014   b           66   708  0.0932 2014-02-01
#> 13 Feb 2014   c           69   708  0.0975 2014-02-01
#> 14 Feb 2014   d           69   708  0.0975 2014-02-01
#> 15 Feb 2014   e           69   708  0.0975 2014-02-01
#> 16 Feb 2014   f           78   708  0.110  2014-02-01
#> 17 Feb 2014   g           71   708  0.100  2014-02-01
#> 18 Feb 2014   h           67   708  0.0946 2014-02-01
#> 19 Feb 2014   i           65   708  0.0918 2014-02-01
#> 20 Feb 2014   j           75   708  0.106  2014-02-01
#> # ... with 730 more rows

Created on 2020-04-24 by the reprex package (v0.3.0)reprex package (v0.3.0) 创建于 2020-04-24

Since this includes the tag, here's a way to prep your data:由于这包括标签,下面是一种准备数据的方法:

DT[,
   {  n = .N 
     .SD[, .(rel_freq = .N / n), by = exposure]},
   by = month.year]

     month.year exposure   rel_freq
         <fctr>   <fctr>      <num>
  1:   Nov 2017        g 0.10840108
  2:   Nov 2017        f 0.10027100
  3:   Nov 2017        d 0.10162602
  4:   Nov 2017        i 0.09485095
  5:   Nov 2017        e 0.11382114
 ---                               
746:   Jul 2018        f 0.10506799
747:   Jul 2018        c 0.10259580
748:   Jul 2018        a 0.10754017
749:   Jul 2018        b 0.10135970
750:   Jul 2018        g 0.11248455

Then you have a few options for the call.然后你有几个选项可以拨打

  1. Make a new variable创建一个新变量
DT_relative = DT[, {n = .N; .SD[, .(rel_freq = .N / n), by = exposure]}, by = month.year]

ggplot(DT_relative, aes(x = month.year, y = rel_freq, color = exposure, group = exposure)) + geom_line()
  1. Include the data transformation in the ggplot callggplot调用中包含数据转换
ggplot(DT[, {n = .N; .SD[, .(rel_freq = .N / n), by = exposure]}, by = month.year],
       aes(x = month.year, y = rel_freq, color = exposure, group = exposure)) + geom_line()
  1. Chain together another [data.table call and use ggplot(.SD)将另一个[data.table调用并使用ggplot(.SD)
DT[, {n = .N; .SD[, .(rel_freq = .N / n), by = exposure]}, by = month.year
   ][, ggplot(.SD, aes(x = month.year, y = rel_freq, color = exposure, group = exposure)) + geom_line()]
  1. Attach magrittr to use the pipe附加magrittr以使用 pipe
library(magrittr)
DT[,
   {  n = .N 
   .SD[, .(rel_freq = .N / n), by = exposure]},
   by = month.year]%>%
  ggplot(., aes(x = month.year, y = rel_freq, color = exposure, group = exposure)) + geom_line()

The other answer includes great formatting for the ggplot() call so please see the other answer for how to make the graph look nice.另一个答案包括ggplot()调用的良好格式,因此请参阅另一个答案以了解如何使图形看起来更漂亮。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM