简体   繁体   English

了解日期并在 R 中使用 ggplot2 绘制直方图

[英]Understanding dates and plotting a histogram with ggplot2 in R

Main Question主要问题

I'm having issues with understanding why the handling of dates, labels and breaks is not working as I would have expected in R when trying to make a histogram with ggplot2.我在理解为什么在尝试使用 ggplot2 制作直方图时,日期、标签和休息时间的处理无法像我在 R 中预期的那样工作时遇到问题。

I'm looking for:我在找:

  • A histogram of the frequency of my dates我约会频率的直方图
  • Tick marks centered under the matching bars刻度线在匹配条下方居中
  • Date labels in %Yb format %Yb格式的日期标签
  • Appropriate limits;适当的限制; minimized empty space between edge of grid space and outermost bars最小化网格空间边缘和最外面条形之间的空白空间

I've uploaded my data to pastebin to make this reproducible.我已将我的数据上传到 pastebin以使其可重现。 I've created several columns as I wasn't sure the best way to do this:我创建了几列,因为我不确定这样做的最佳方法:

> dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
> head(dates)
       YM       Date Year Month
1 2008-Apr 2008-04-01 2008     4
2 2009-Apr 2009-04-01 2009     4
3 2009-Apr 2009-04-01 2009     4
4 2009-Apr 2009-04-01 2009     4
5 2009-Apr 2009-04-01 2009     4
6 2009-Apr 2009-04-01 2009     4

Here's what I tried:这是我尝试过的:

library(ggplot2)
library(scales)
dates$converted <- as.Date(dates$Date, format="%Y-%m-%d")

ggplot(dates, aes(x=converted)) + geom_histogram()
+      opts(axis.text.x = theme_text(angle=90))

Which yields this graph .这产生了这个图 I wanted %Y-%b formatting, though, so I hunted around and tried the following, based on this SO :不过,我想要%Y-%b格式,所以我四处寻找并尝试了以下内容,基于这个 SO

ggplot(dates, aes(x=converted)) + geom_histogram()
+    scale_x_date(labels=date_format("%Y-%b"),
+    breaks = "1 month")
+    opts(axis.text.x = theme_text(angle=90))

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

That gives me this graph这给了我这个图表

  • Correct x axis label format正确的 x 轴标签格式
  • The frequency distribution has changed shape (binwidth issue?)频率分布已改变形状(binwidth 问题?)
  • Tick marks don't appear centered under bars刻度线不会出现在条形下方居中
  • The xlims have changed as well xlims 也发生了变化

I worked through the example in the ggplot2 documentation at the scale_x_date section and geom_line() appears to break, label, and center ticks correctly when I use it with my same x-axis data.我完成了ggplot2 文档scale_x_date部分的示例,当我将geom_line()与相同的 x 轴数据一起使用时,它似乎正确地中断、标记和居中刻度。 I don't understand why the histogram is different.我不明白为什么直方图不同。


Updates based on answers from edgester and gauden根据 edgester 和 gauden 的回答进行更新

I initially thought gauden's answer helped me solve my problem, but am now puzzled after looking more closely.我最初认为 gauden 的回答帮助我解决了我的问题,但现在仔细观察后感到困惑。 Note the differences between the two answers' resulting graphs after the code.请注意代码后两个答案的结果图之间的差异。

Assume for both:假设两者:

library(ggplot2)
library(scales)
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)

Based on @edgester's answer below, I was able to do the following:根据下面@edgester 的回答,我能够执行以下操作:

freqs <- aggregate(dates$Date, by=list(dates$Date), FUN=length)
freqs$names <- as.Date(freqs$Group.1, format="%Y-%m-%d")

ggplot(freqs, aes(x=names, y=x)) + geom_bar(stat="identity") +
       scale_x_date(breaks="1 month", labels=date_format("%Y-%b"),
                    limits=c(as.Date("2008-04-30"),as.Date("2012-04-01"))) +
       ylab("Frequency") + xlab("Year and Month") +
       theme_bw() + opts(axis.text.x = theme_text(angle=90))

Here is my attempt based on gauden's answer:这是我基于 gauden 回答的尝试:

dates$Date <- as.Date(dates$Date)
ggplot(dates, aes(x=Date)) + geom_histogram(binwidth=30, colour="white") +
       scale_x_date(labels = date_format("%Y-%b"),
                    breaks = seq(min(dates$Date)-5, max(dates$Date)+5, 30),
                    limits = c(as.Date("2008-05-01"), as.Date("2012-04-01"))) +
       ylab("Frequency") + xlab("Year and Month") +
       theme_bw() + opts(axis.text.x = theme_text(angle=90))

Plot based on edgester's approach:基于edgester方法的绘图:

边缘图

Plot based on gauden's approach:基于 gauden 方法的绘图:

高登情节

Note the following:请注意以下事项:

  • gaps in gauden's plot for 2009-Dec and 2010-Mar; 2009 年 12 月和 2010 年 3 月的高登图的差距; table(dates$Date) reveals that there are 19 instances of 2009-12-01 and 26 instances of 2010-03-01 in the data table(dates$Date)显示数据中有2009-12-01 19 个实例和2010-03-01 26 个实例
  • edgester's plot starts at 2008-Apr and ends at 2012-May. edgester 的情节从 2008 年四月开始,到 2012 年五月结束。 This is correct based on a minimum value in the data of 2008-04-01 and a max date of 2012-05-01.根据 2008-04-01 数据中的最小值和 2012-05-01 的最大日期,这是正确的。 For some reason gauden's plot starts in 2008-Mar and still somehow manages to end at 2012-May.出于某种原因,高登的情节从 2008 年 3 月开始,但仍然设法在 2012 年 5 月结束。 After counting bins and reading along the month labels, for the life of me I can't figure out which plot has an extra or is missing a bin of the histogram!在计算箱数并阅读月份标签后,我一生都无法弄清楚哪个图有额外的或缺少直方图的箱!

Any thoughts on the differences here?对这里的差异有什么想法吗? edgester's method of creating a separate count edgester 创建单独计数的方法


Related References相关参考资料

As an aside, here are other locations that have information about dates and ggplot2 for passers-by looking for help:顺便说一句,这里有其他位置提供有关日期和 ggplot2 的信息,供路人寻求帮助:

  • Started here at learnr.wordpress, a popular R blog. learnr.wordpress 开始,这是一个流行的R 博客。 It stated that I needed to get my data into POSIXct format, which I now think is false and wasted my time.它说我需要将我的数据转换为 POSIXct 格式,我现在认为这是错误的并且浪费了我的时间。
  • Another learnr post recreates a time series in ggplot2, but wasn't really applicable to my situation. 另一个学习者帖子在 ggplot2 中重新创建了一个时间序列,但并不真正适用于我的情况。
  • r-bloggers has a post on this , but it appears outdated. r-bloggers 有一篇关于此的帖子,但它似乎已经过时了。 The simple format= option did not work for me.简单的format=选项对我不起作用。
  • This SO question is playing with breaks and labels. 这个SO问题正在玩休息和标签。 I tried treating my Date vector as continuous and don't think it worked so well.我尝试将我的Date向量视为连续的,但认为它效果不佳。 It looked like it was overlaying the same label text over and over so the letters looked kind of odd.它看起来像是一遍又一遍地覆盖相同的标签文本,所以字母看起来有点奇怪。 The distribution is sort of correct but there are odd breaks.分布有点正确,但有一些奇怪的中断。 My attempt based on the accepted answer was like so ( result here ).我基于接受的答案的尝试是这样的(结果在这里)。

UPDATE更新

Version 2: Using Date class版本 2:使用 Date 类

I update the example to demonstrate aligning the labels and setting limits on the plot.我更新了示例以演示对齐标签和设置图上的限制。 I also demonstrate that as.Date does indeed work when used consistently (actually it is probably a better fit for your data than my earlier example).我还证明了as.Date在一致使用时确实有效(实际上它可能比我之前的示例更适合您的数据)。

The Target Plot v2目标图 v2

基于日期的直方图

The Code v2代码 v2

And here is (somewhat excessively) commented code:这是(有点过分)注释的代码:

library("ggplot2")
library("scales")

dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.Date(dates$Date)

# convert the Date to its numeric equivalent
# Note that Dates are stored as number of days internally,
# hence it is easy to convert back and forth mentally
dates$num <- as.numeric(dates$Date)

bin <- 60 # used for aggregating the data and aligning the labels

p <- ggplot(dates, aes(num, ..count..))
p <- p + geom_histogram(binwidth = bin, colour="white")

# The numeric data is treated as a date,
# breaks are set to an interval equal to the binwidth,
# and a set of labels is generated and adjusted in order to align with bars
p <- p + scale_x_date(breaks = seq(min(dates$num)-20, # change -20 term to taste
                                   max(dates$num), 
                                   bin),
                      labels = date_format("%Y-%b"),
                      limits = c(as.Date("2009-01-01"), 
                                 as.Date("2011-12-01")))

# from here, format at ease
p <- p + theme_bw() + xlab(NULL) + opts(axis.text.x  = theme_text(angle=45,
                                                                  hjust = 1,
                                                                  vjust = 1))
p

Version 1: Using POSIXct版本 1:使用 POSIXct

I try a solution that does everything in ggplot2 , drawing without the aggregation, and setting the limits on the x-axis between the beginning of 2009 and the end of 2011.我尝试了一个解决方案,它可以在ggplot2中完成所有操作,在没有聚合的情况下进行绘制,并在 2009 年初和 2011 年底之间设置 x 轴的限制。

The Target Plot v1目标图 v1

在 ggplot2 中设置限制的绘图

The Code v1代码 v1

library("ggplot2")
library("scales")

dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.POSIXct(dates$Date)

p <- ggplot(dates, aes(Date, ..count..)) + 
    geom_histogram() +
    theme_bw() + xlab(NULL) +
    scale_x_datetime(breaks = date_breaks("3 months"),
                     labels = date_format("%Y-%b"),
                     limits = c(as.POSIXct("2009-01-01"), 
                                as.POSIXct("2011-12-01")) )

p

Of course, it could do with playing with the label options on the axis, but this is to round off the plotting with a clean short routine in the plotting package.当然,它可以与轴上的标签选项一起玩,但这是通过绘图包中的干净简短例程来完成绘图。

I think the key thing is that you need to do the frequency calculation outside of ggplot.我认为关键是您需要在ggplot之外进行频率计算。 Use aggregate() with geom_bar(stat="identity") to get a histogram without the reordered factors.使用aggregate() 和geom_bar(stat="identity") 来获得没有重新排序因素的直方图。 Here is some example code:下面是一些示例代码:

require(ggplot2)

# scales goes with ggplot and adds the needed scale* functions
require(scales)

# need the month() function for the extra plot
require(lubridate)

# original data
#df<-read.csv("http://pastebin.com/download.php?i=sDzXKFxJ", header=TRUE)

# simulated data
years=sample(seq(2008,2012),681,replace=TRUE,prob=c(0.0176211453744493,0.302496328928047,0.323054331864905,0.237885462555066,0.118942731277533))
months=sample(seq(1,12),681,replace=TRUE)
my.dates=as.Date(paste(years,months,01,sep="-"))
df=data.frame(YM=strftime(my.dates, format="%Y-%b"),Date=my.dates,Year=years,Month=months)
# end simulated data creation

# sort the list just to make it pretty. It makes no difference in the final results
df=df[do.call(order, df[c("Date")]), ]

# add a dummy column for clarity in processing
df$Count=1

# compute the frequencies ourselves
freqs=aggregate(Count ~ Year + Month, data=df, FUN=length)

# rebuild the Date column so that ggplot works
freqs$Date=as.Date(paste(freqs$Year,freqs$Month,"01",sep="-"))

# I set the breaks for 2 months to reduce clutter
g<-ggplot(data=freqs,aes(x=Date,y=Count))+ geom_bar(stat="identity") + scale_x_date(labels=date_format("%Y-%b"),breaks="2 months") + theme_bw() + opts(axis.text.x = theme_text(angle=90))
print(g)

# don't overwrite the previous graph
dev.new()

# just for grins, here is a faceted view by year
# Add the Month.name factor to have things work. month() keeps the factor levels in order
freqs$Month.name=month(freqs$Date,label=TRUE, abbr=TRUE)
g2<-ggplot(data=freqs,aes(x=Month.name,y=Count))+ geom_bar(stat="identity") + facet_grid(Year~.) + theme_bw()
print(g2)

I know this is an old question, but for anybody coming to this in 2021 (or later), this can be done much easier using the breaks= argument for geom_histogram() and creating a little shortcut function to make the required sequence.我知道这是一个老问题,但对于在 2021 年(或之后)提出这个问题的任何人,使用geom_histogram()breaks=参数并创建一个小快捷函数来制作所需的序列可以更容易地做到这一点。

dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)

dates$Date <- lubridate::ymd(dates$Date)

by_month <- function(x,n=1){
  seq(min(x,na.rm=T),max(x,na.rm=T),by=paste0(n," months"))
}

ggplot(dates,aes(Date)) +
  geom_histogram(breaks = by_month(dates$Date)) +
  scale_x_date(labels = scales::date_format("%Y-%b"),
               breaks = by_month(dates$Date,2)) + 
  theme(axis.text.x = element_text(angle=90))

直方图

The error graph this under the title "Plot based on Gauden's approach" is due to the binwidth parameter: ... + Geom_histogram (binwidth = 30, color = "white") + ... If we change the value of 30 to a value less than 20, such as 10, you will get all frequencies.标题为“Plot based on Gauden's approach”的错误图是由于 binwidth 参数造成的: ... + Geom_histogram (binwidth = 30, color = "white") + ... 如果我们将 30 的值更改为 a值小于 20,例如 10,您将获得所有频率。

In statistics the values are more important than the presentation is more important a bland graphic to a very pretty picture but with errors.在统计学中,价值比展示更重要,对于非常漂亮但有错误的图片来说,平淡的图形更重要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM