简体   繁体   English

在一个图上显示多个时间序列中的缺失值

[英]Show missing values in multiple time-series on one plot

I have five different time-series for continuous three years. 我有五个不同的时间序列,连续三年。 Now I want to show the missing values in these series by the gaps on a plot. 现在,我想通过绘图上的间隙显示这些系列中的缺失值。 So, I thought that I will create another data frame corresponding to these series and where ever I have a value, I will replace that with one and leave NA's as such. 因此,我认为我将创建与这些系列相对应的另一个数据框,并且只要有值,我都将其替换为一个并保留NA。 Such a dummy data frame is as: 这样的伪数据帧如下:

# create sample time index
timeindex <- seq(as.POSIXct("2014-01-01"),as.POSIXct("2016-12-31"),by="1 mins")
# create 5 sample series of same length as of time index
sequence_1 <- sample(seq(from = 0, to = 1, by = 1), size =  length(timeindex), replace = TRUE)
sequence_2 <- sample(seq(from = 0, to = 1, by = 1), size =  length(timeindex), replace = TRUE)
sequence_3 <- sample(seq(from = 0, to = 1, by = 1), size =  length(timeindex), replace = TRUE)
sequence_4 <- sample(seq(from = 0, to = 1, by = 1), size =  length(timeindex), replace = TRUE)
sequence_5 <- sample(seq(from = 0, to = 1, by = 1), size =  length(timeindex), replace = TRUE)
# create data frame of sequences
df <- data.frame(sequence_1,sequence_2,sequence_3,sequence_4,sequence_5)
df <- ifelse(df==0,NA,1) # replace 0 with NA to show missing data values
df_with_time <- data.frame(timeindex,df) # attach timestamp to sequences

Now the question is how to show missing values (gaps) in one graph. 现在的问题是如何在一张图中显示缺失值(间隙)。 I melted my data frame and thought of using geom_line() with facet_grid() , but it seems that my computer hangs for an indefinite time. 我融化了数据框架,并想到了将geom_line()facet_grid() ,但是看来我的计算机挂起了不确定的时间。 The code is: 代码是:

library(ggplot2)
df_melt <- reshape2::melt(df_with_time,id.vars="timeindex") # melt for ggplot
ggplot(df_melt,aes(timeindex,value,variable)) +  geom_line() + facet_grid(variable~.)
#ggplot(df_melt,aes(timeindex,value,variable)) +  geom_area() + facet_grid(variable~.)

Now I have two questions: 现在我有两个问题:

  1. Although ggplot fails to plot this huge data on a machine with 8GB RAM, 2.6 GHZ processor. 尽管ggplot无法在具有8GB RAM,2.6 GHZ处理器的计算机上绘制大量数据。 Is there any other way to plot such huge data? 还有其他方法可以绘制如此庞大的数据吗?
  2. Is there any other way to show gaps(missing values) in the data? 还有其他方法可以显示数据中的差距(缺失值)吗?

UPDATE I want a plot something like this: 更新我想要一个像这样的情节: 在此处输入图片说明

The missing data points are shown as gaps. 缺失的数据点显示为间隙。

If aggregating NAs across series is no go, I would suggest to perform a time-based binning of your data. 如果无法跨系列汇总NA,则建议对数据执行基于时间的装箱。 Briefly, you can count how many NA you have in 30-min or 60-min windows and plot the counts using ggplot. 简而言之,您可以计算30分钟或60分钟的窗口中有多少个NA,并使用ggplot绘制计数。 I show an example below. 我在下面展示一个例子。

# Binning
head(df_with_time)
time.gap <- 60 # bin by hour
idx <- seq(1, nrow(df_with_time), by = time.gap) 
na.counts <- lapply(idx[-length(idx)], (function(i){
  tmp <- df_with_time[i:(i+(time.gap-1)),]
  counts <- apply(tmp[,-1], 2, (function(y){ sum(is.na(y)) }))
  counts
}))
na.counts <- data.frame(time=df_with_time[idx[-length(idx)],]$timeindex, 
                        do.call(rbind, na.counts), 
                        stringsAsFactors = FALSE,
                        row.names = NULL)
head(na.counts)

# Convert to suitable df and then plot (color tracks with NA count)
df_melt <- reshape2::melt(na.counts,id.vars="time") # melt for ggplot
df_melt$y <- as.integer(as.factor(df_melt$variable))
df_melt <- df_melt[order(df_melt$value - median(df_melt$value)), ]

ggplot(df_melt,aes(x=time, y=y)) +  
  geom_point(aes(colour = value), shape = 124, alpha = 0.75, size = 2.5) + 
  scale_colour_gradient2(low = "#01665e", mid = "#f5f5f5", high = "#8c510a", midpoint = median(df_melt$value))

This is the result. 这就是结果。 在此处输入图片说明

Alternatively, you may want to get rid of values too close to the median and only plot your 'outliers'. 另外,您可能希望摆脱太接近中值的值,而只绘制“离群值”。 Since this removes a lot of data, the plot will be generated quickly. 由于这会删除大量数据,因此将快速生成绘图。

df_melt2 <- df_melt[abs(df_melt$value - median(df_melt$value)) > 8, ]

ggplot(df_melt2,aes(x=time, y=y)) +  
  geom_point(aes(colour = value), shape = 124, alpha = 0.75, size = 4.5) + 
  scale_colour_gradient2(low = "#01665e", mid = "#f5f5f5", high = "#8c510a", midpoint = median(df_melt$value))

在此处输入图片说明

PS: I assumed you are interested in those values that are far from the median. PS:我认为您对那些远离中位数的值感兴趣。 If you care about total NA counts, use scale_colour_gradient() instead. 如果您关心总的NA计数,请改用scale_colour_gradient()。

If you reshape it into long format: 如果将其重塑为长格式:

data <- reshape2::melt(df_with_time,
                       id.vars="timeindex", 
                       variable.name = 'Sequence', 
                       value.name = 'Data')

you can plot it in ggplot like you want: 您可以根据需要在ggplot中绘制它:

ggplot(data, 
       aes(x = timeindex, 
           y = Sequence, 
           size = Data)) + 
geom_line()

This is for a single month, to keep things smaller: 这是一个月,以使事情变小: 在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM