[英]Show missing values in multiple time-series on one plot
I have five different time-series for continuous three years. 我有五个不同的时间序列,连续三年。 Now I want to show the missing values in these series by the gaps on a plot. 现在,我想通过绘图上的间隙显示这些系列中的缺失值。 So, I thought that I will create another data frame corresponding to these series and where ever I have a value, I will replace that with one and leave NA's as such. 因此,我认为我将创建与这些系列相对应的另一个数据框,并且只要有值,我都将其替换为一个并保留NA。 Such a dummy data frame is as: 这样的伪数据帧如下:
# create sample time index
timeindex <- seq(as.POSIXct("2014-01-01"),as.POSIXct("2016-12-31"),by="1 mins")
# create 5 sample series of same length as of time index
sequence_1 <- sample(seq(from = 0, to = 1, by = 1), size = length(timeindex), replace = TRUE)
sequence_2 <- sample(seq(from = 0, to = 1, by = 1), size = length(timeindex), replace = TRUE)
sequence_3 <- sample(seq(from = 0, to = 1, by = 1), size = length(timeindex), replace = TRUE)
sequence_4 <- sample(seq(from = 0, to = 1, by = 1), size = length(timeindex), replace = TRUE)
sequence_5 <- sample(seq(from = 0, to = 1, by = 1), size = length(timeindex), replace = TRUE)
# create data frame of sequences
df <- data.frame(sequence_1,sequence_2,sequence_3,sequence_4,sequence_5)
df <- ifelse(df==0,NA,1) # replace 0 with NA to show missing data values
df_with_time <- data.frame(timeindex,df) # attach timestamp to sequences
Now the question is how to show missing values (gaps) in one graph. 现在的问题是如何在一张图中显示缺失值(间隙)。 I melted my data frame and thought of using geom_line()
with facet_grid()
, but it seems that my computer hangs for an indefinite time. 我融化了数据框架,并想到了将geom_line()
与facet_grid()
,但是看来我的计算机挂起了不确定的时间。 The code is: 代码是:
library(ggplot2)
df_melt <- reshape2::melt(df_with_time,id.vars="timeindex") # melt for ggplot
ggplot(df_melt,aes(timeindex,value,variable)) + geom_line() + facet_grid(variable~.)
#ggplot(df_melt,aes(timeindex,value,variable)) + geom_area() + facet_grid(variable~.)
Now I have two questions: 现在我有两个问题:
UPDATE I want a plot something like this: 更新我想要一个像这样的情节:
The missing data points are shown as gaps. 缺失的数据点显示为间隙。
If aggregating NAs across series is no go, I would suggest to perform a time-based binning of your data. 如果无法跨系列汇总NA,则建议对数据执行基于时间的装箱。 Briefly, you can count how many NA you have in 30-min or 60-min windows and plot the counts using ggplot. 简而言之,您可以计算30分钟或60分钟的窗口中有多少个NA,并使用ggplot绘制计数。 I show an example below. 我在下面展示一个例子。
# Binning
head(df_with_time)
time.gap <- 60 # bin by hour
idx <- seq(1, nrow(df_with_time), by = time.gap)
na.counts <- lapply(idx[-length(idx)], (function(i){
tmp <- df_with_time[i:(i+(time.gap-1)),]
counts <- apply(tmp[,-1], 2, (function(y){ sum(is.na(y)) }))
counts
}))
na.counts <- data.frame(time=df_with_time[idx[-length(idx)],]$timeindex,
do.call(rbind, na.counts),
stringsAsFactors = FALSE,
row.names = NULL)
head(na.counts)
# Convert to suitable df and then plot (color tracks with NA count)
df_melt <- reshape2::melt(na.counts,id.vars="time") # melt for ggplot
df_melt$y <- as.integer(as.factor(df_melt$variable))
df_melt <- df_melt[order(df_melt$value - median(df_melt$value)), ]
ggplot(df_melt,aes(x=time, y=y)) +
geom_point(aes(colour = value), shape = 124, alpha = 0.75, size = 2.5) +
scale_colour_gradient2(low = "#01665e", mid = "#f5f5f5", high = "#8c510a", midpoint = median(df_melt$value))
Alternatively, you may want to get rid of values too close to the median and only plot your 'outliers'. 另外,您可能希望摆脱太接近中值的值,而只绘制“离群值”。 Since this removes a lot of data, the plot will be generated quickly. 由于这会删除大量数据,因此将快速生成绘图。
df_melt2 <- df_melt[abs(df_melt$value - median(df_melt$value)) > 8, ]
ggplot(df_melt2,aes(x=time, y=y)) +
geom_point(aes(colour = value), shape = 124, alpha = 0.75, size = 4.5) +
scale_colour_gradient2(low = "#01665e", mid = "#f5f5f5", high = "#8c510a", midpoint = median(df_melt$value))
PS: I assumed you are interested in those values that are far from the median. PS:我认为您对那些远离中位数的值感兴趣。 If you care about total NA counts, use scale_colour_gradient() instead. 如果您关心总的NA计数,请改用scale_colour_gradient()。
If you reshape it into long format: 如果将其重塑为长格式:
data <- reshape2::melt(df_with_time,
id.vars="timeindex",
variable.name = 'Sequence',
value.name = 'Data')
you can plot it in ggplot like you want: 您可以根据需要在ggplot中绘制它:
ggplot(data,
aes(x = timeindex,
y = Sequence,
size = Data)) +
geom_line()
This is for a single month, to keep things smaller: 这是一个月,以使事情变小:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.