[英]Understanding dates and plotting a histogram with ggplot2 in R
I'm having issues with understanding why the handling of dates, labels and breaks is not working as I would have expected in R when trying to make a histogram with ggplot2.我在理解为什么在尝试使用 ggplot2 制作直方图时,日期、标签和休息时间的处理无法像我在 R 中预期的那样工作时遇到问题。
I'm looking for:我在找:
%Yb
format %Yb
格式的日期标签I've uploaded my data to pastebin to make this reproducible.我已将我的数据上传到 pastebin以使其可重现。 I've created several columns as I wasn't sure the best way to do this:
我创建了几列,因为我不确定这样做的最佳方法:
> dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
> head(dates)
YM Date Year Month
1 2008-Apr 2008-04-01 2008 4
2 2009-Apr 2009-04-01 2009 4
3 2009-Apr 2009-04-01 2009 4
4 2009-Apr 2009-04-01 2009 4
5 2009-Apr 2009-04-01 2009 4
6 2009-Apr 2009-04-01 2009 4
Here's what I tried:这是我尝试过的:
library(ggplot2)
library(scales)
dates$converted <- as.Date(dates$Date, format="%Y-%m-%d")
ggplot(dates, aes(x=converted)) + geom_histogram()
+ opts(axis.text.x = theme_text(angle=90))
Which yields this graph .这产生了这个图。 I wanted
%Y-%b
formatting, though, so I hunted around and tried the following, based on this SO :不过,我想要
%Y-%b
格式,所以我四处寻找并尝试了以下内容,基于这个 SO :
ggplot(dates, aes(x=converted)) + geom_histogram()
+ scale_x_date(labels=date_format("%Y-%b"),
+ breaks = "1 month")
+ opts(axis.text.x = theme_text(angle=90))
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
That gives me this graph这给了我这个图表
I worked through the example in the ggplot2 documentation at the scale_x_date
section and geom_line()
appears to break, label, and center ticks correctly when I use it with my same x-axis data.我完成了ggplot2 文档中
scale_x_date
部分的示例,当我将geom_line()
与相同的 x 轴数据一起使用时,它似乎正确地中断、标记和居中刻度。 I don't understand why the histogram is different.我不明白为什么直方图不同。
I initially thought gauden's answer helped me solve my problem, but am now puzzled after looking more closely.我最初认为 gauden 的回答帮助我解决了我的问题,但现在仔细观察后感到困惑。 Note the differences between the two answers' resulting graphs after the code.
请注意代码后两个答案的结果图之间的差异。
Assume for both:假设两者:
library(ggplot2)
library(scales)
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
Based on @edgester's answer below, I was able to do the following:根据下面@edgester 的回答,我能够执行以下操作:
freqs <- aggregate(dates$Date, by=list(dates$Date), FUN=length)
freqs$names <- as.Date(freqs$Group.1, format="%Y-%m-%d")
ggplot(freqs, aes(x=names, y=x)) + geom_bar(stat="identity") +
scale_x_date(breaks="1 month", labels=date_format("%Y-%b"),
limits=c(as.Date("2008-04-30"),as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Here is my attempt based on gauden's answer:这是我基于 gauden 回答的尝试:
dates$Date <- as.Date(dates$Date)
ggplot(dates, aes(x=Date)) + geom_histogram(binwidth=30, colour="white") +
scale_x_date(labels = date_format("%Y-%b"),
breaks = seq(min(dates$Date)-5, max(dates$Date)+5, 30),
limits = c(as.Date("2008-05-01"), as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Plot based on edgester's approach:基于edgester方法的绘图:
Plot based on gauden's approach:基于 gauden 方法的绘图:
Note the following:请注意以下事项:
table(dates$Date)
reveals that there are 19 instances of 2009-12-01
and 26 instances of 2010-03-01
in the data table(dates$Date)
显示数据中有2009-12-01
19 个实例和2010-03-01
26 个实例 Any thoughts on the differences here?对这里的差异有什么想法吗? edgester's method of creating a separate count
edgester 创建单独计数的方法
As an aside, here are other locations that have information about dates and ggplot2 for passers-by looking for help:顺便说一句,这里有其他位置提供有关日期和 ggplot2 的信息,供路人寻求帮助:
format=
option did not work for me.format=
选项对我不起作用。Date
vector as continuous and don't think it worked so well.Date
向量视为连续的,但认为它效果不佳。 It looked like it was overlaying the same label text over and over so the letters looked kind of odd.UPDATE更新
I update the example to demonstrate aligning the labels and setting limits on the plot.我更新了示例以演示对齐标签和设置图上的限制。 I also demonstrate that
as.Date
does indeed work when used consistently (actually it is probably a better fit for your data than my earlier example).我还证明了
as.Date
在一致使用时确实有效(实际上它可能比我之前的示例更适合您的数据)。
And here is (somewhat excessively) commented code:这是(有点过分)注释的代码:
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.Date(dates$Date)
# convert the Date to its numeric equivalent
# Note that Dates are stored as number of days internally,
# hence it is easy to convert back and forth mentally
dates$num <- as.numeric(dates$Date)
bin <- 60 # used for aggregating the data and aligning the labels
p <- ggplot(dates, aes(num, ..count..))
p <- p + geom_histogram(binwidth = bin, colour="white")
# The numeric data is treated as a date,
# breaks are set to an interval equal to the binwidth,
# and a set of labels is generated and adjusted in order to align with bars
p <- p + scale_x_date(breaks = seq(min(dates$num)-20, # change -20 term to taste
max(dates$num),
bin),
labels = date_format("%Y-%b"),
limits = c(as.Date("2009-01-01"),
as.Date("2011-12-01")))
# from here, format at ease
p <- p + theme_bw() + xlab(NULL) + opts(axis.text.x = theme_text(angle=45,
hjust = 1,
vjust = 1))
p
I try a solution that does everything in ggplot2
, drawing without the aggregation, and setting the limits on the x-axis between the beginning of 2009 and the end of 2011.我尝试了一个解决方案,它可以在
ggplot2
中完成所有操作,在没有聚合的情况下进行绘制,并在 2009 年初和 2011 年底之间设置 x 轴的限制。
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.POSIXct(dates$Date)
p <- ggplot(dates, aes(Date, ..count..)) +
geom_histogram() +
theme_bw() + xlab(NULL) +
scale_x_datetime(breaks = date_breaks("3 months"),
labels = date_format("%Y-%b"),
limits = c(as.POSIXct("2009-01-01"),
as.POSIXct("2011-12-01")) )
p
Of course, it could do with playing with the label options on the axis, but this is to round off the plotting with a clean short routine in the plotting package.当然,它可以与轴上的标签选项一起玩,但这是通过绘图包中的干净简短例程来完成绘图。
I think the key thing is that you need to do the frequency calculation outside of ggplot.我认为关键是您需要在ggplot之外进行频率计算。 Use aggregate() with geom_bar(stat="identity") to get a histogram without the reordered factors.
使用aggregate() 和geom_bar(stat="identity") 来获得没有重新排序因素的直方图。 Here is some example code:
下面是一些示例代码:
require(ggplot2)
# scales goes with ggplot and adds the needed scale* functions
require(scales)
# need the month() function for the extra plot
require(lubridate)
# original data
#df<-read.csv("http://pastebin.com/download.php?i=sDzXKFxJ", header=TRUE)
# simulated data
years=sample(seq(2008,2012),681,replace=TRUE,prob=c(0.0176211453744493,0.302496328928047,0.323054331864905,0.237885462555066,0.118942731277533))
months=sample(seq(1,12),681,replace=TRUE)
my.dates=as.Date(paste(years,months,01,sep="-"))
df=data.frame(YM=strftime(my.dates, format="%Y-%b"),Date=my.dates,Year=years,Month=months)
# end simulated data creation
# sort the list just to make it pretty. It makes no difference in the final results
df=df[do.call(order, df[c("Date")]), ]
# add a dummy column for clarity in processing
df$Count=1
# compute the frequencies ourselves
freqs=aggregate(Count ~ Year + Month, data=df, FUN=length)
# rebuild the Date column so that ggplot works
freqs$Date=as.Date(paste(freqs$Year,freqs$Month,"01",sep="-"))
# I set the breaks for 2 months to reduce clutter
g<-ggplot(data=freqs,aes(x=Date,y=Count))+ geom_bar(stat="identity") + scale_x_date(labels=date_format("%Y-%b"),breaks="2 months") + theme_bw() + opts(axis.text.x = theme_text(angle=90))
print(g)
# don't overwrite the previous graph
dev.new()
# just for grins, here is a faceted view by year
# Add the Month.name factor to have things work. month() keeps the factor levels in order
freqs$Month.name=month(freqs$Date,label=TRUE, abbr=TRUE)
g2<-ggplot(data=freqs,aes(x=Month.name,y=Count))+ geom_bar(stat="identity") + facet_grid(Year~.) + theme_bw()
print(g2)
I know this is an old question, but for anybody coming to this in 2021 (or later), this can be done much easier using the breaks=
argument for geom_histogram()
and creating a little shortcut function to make the required sequence.我知道这是一个老问题,但对于在 2021 年(或之后)提出这个问题的任何人,使用
geom_histogram()
的breaks=
参数并创建一个小快捷函数来制作所需的序列可以更容易地做到这一点。
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- lubridate::ymd(dates$Date)
by_month <- function(x,n=1){
seq(min(x,na.rm=T),max(x,na.rm=T),by=paste0(n," months"))
}
ggplot(dates,aes(Date)) +
geom_histogram(breaks = by_month(dates$Date)) +
scale_x_date(labels = scales::date_format("%Y-%b"),
breaks = by_month(dates$Date,2)) +
theme(axis.text.x = element_text(angle=90))
The error graph this under the title "Plot based on Gauden's approach" is due to the binwidth parameter: ... + Geom_histogram (binwidth = 30, color = "white") + ... If we change the value of 30 to a value less than 20, such as 10, you will get all frequencies.标题为“Plot based on Gauden's approach”的错误图是由于 binwidth 参数造成的: ... + Geom_histogram (binwidth = 30, color = "white") + ... 如果我们将 30 的值更改为 a值小于 20,例如 10,您将获得所有频率。
In statistics the values are more important than the presentation is more important a bland graphic to a very pretty picture but with errors.在统计学中,价值比展示更重要,对于非常漂亮但有错误的图片来说,平淡的图形更重要。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.