简体   繁体   English

大熊猫绘制时间序列,最小化间隙

[英]pandas plot time-series with minimized gaps

I recently started to explore into the depths of pandas and would like to visualize some time-series data which contains gaps, some of them rather large. 我最近开始探索大熊猫的深度,并希望可视化一些包含间隙的时间序列数据,其中一些相当大。 an example mydf : 一个例子mydf

             timestamp       val
0  2016-07-25 00:00:00  0.740442
1  2016-07-25 01:00:00  0.842911
2  2016-07-25 02:00:00 -0.873992
3  2016-07-25 07:00:00 -0.474993
4  2016-07-25 08:00:00 -0.983963
5  2016-07-25 09:00:00  0.597011
6  2016-07-25 10:00:00 -2.043023
7  2016-07-25 12:00:00  0.304668
8  2016-07-25 13:00:00  1.185997
9  2016-07-25 14:00:00  0.920850
10 2016-07-25 15:00:00  0.201423
11 2016-07-25 16:00:00  0.842970
12 2016-07-25 21:00:00  1.061207
13 2016-07-25 22:00:00  0.232180
14 2016-07-25 23:00:00  0.453964

now i could plot my DataFrame through df1.plot(x='timestamp').get_figure().show() and data along the x-axis would be interpolated (appearing as one line): 现在我可以通过df1.plot(x='timestamp').get_figure().show()绘制我的DataFrame。get_figure df1.plot(x='timestamp').get_figure().show()和x轴上的数据将被内插(显示为一行): plot0

what i would like to have instead is: 我想要的是:

  • visible gaps between sections with data 带有数据的部分之间的可见间隙
  • a consistent gap-width for differing gaps-legths 不同间隙 - 长度的一致间隙宽度
  • perhaps some form of marker in the axis which helps to clarify the fact that some jumps in time are performed. 轴上可能还有某种形式的标记,这有助于澄清一些时间跳跃的事实。

researching in this matter i've come across 研究这个问题我遇到过

which generally come close to what i'm after but the former approach would yield in simply leaving the gaps out of the plotted figure and the latter in large gaps that i would like to avoid (think of gaps that may even span a few days). 这通常接近我所追求的,但是前一种方法只会使绘制的数字中的空隙脱离而后者会产生我希望避免的大间隙(想想可能跨越几天的空隙) 。

as the second approach may be closer i tried to use my timestamp-column as an index through: 因为第二种方法可能更接近我试图使用我的timestamp-column作为索引通过:

mydf2 = pd.DataFrame(data=list(mydf['val']), index=mydf[0])

which allows me to fill the gaps with NaN through reindexing (wondering if there is a more simple solution to achive this) : 这让我可以通过重新索引填补NaN的空白(想知道是否有更简单的解决方案来实现这一点)

mydf3 = mydf2.reindex(pd.date_range('25/7/2016', periods=24, freq='H'))

leading to: 导致:

                          val
2016-07-25 00:00:00  0.740442
2016-07-25 01:00:00  0.842911
2016-07-25 02:00:00 -0.873992
2016-07-25 03:00:00       NaN
2016-07-25 04:00:00       NaN
2016-07-25 05:00:00       NaN
2016-07-25 06:00:00       NaN
2016-07-25 07:00:00 -0.474993
2016-07-25 08:00:00 -0.983963
2016-07-25 09:00:00  0.597011
2016-07-25 10:00:00 -2.043023
2016-07-25 11:00:00       NaN
2016-07-25 12:00:00  0.304668
2016-07-25 13:00:00  1.185997
2016-07-25 14:00:00  0.920850
2016-07-25 15:00:00  0.201423
2016-07-25 16:00:00  0.842970
2016-07-25 17:00:00       NaN
2016-07-25 18:00:00       NaN
2016-07-25 19:00:00       NaN
2016-07-25 20:00:00       NaN
2016-07-25 21:00:00  1.061207
2016-07-25 22:00:00  0.232180
2016-07-25 23:00:00  0.453964

from here on i might need to reduce consecutive entries over a certain limit with missing data to a fix number (representing my gap-width) and do something to the index-value of these entries so they are plotted differently but i got lost here i guess as i don't know how to achieve something like that. 从这里开始我可能需要减少超过某个限制的连续条目,并将数据丢失到一个修正数(代表我的间隙宽度)并对这些条目的索引值做一些事情,这样它们就会被绘制得不同但我迷失在这里我因为我不知道如何实现这样的东西。

while tinkering around i wondered if there might be a more direct and elegant approach and would be thankful if anyone knowing more about this could point me towards the right direction. 在修修补补的同时,我想知道是否有更直接和优雅的方法,如果有人知道更多这方面可能会指向正确的方向,那将会感激不尽。

thanks for any hints and feedback in advance! 感谢任何提示和反馈提前!

### ADDENDUM ### ### ADDENDUM ###

After posting my question I've come across another interesting idea postend by Andy Hayden that seems helpful. 在发布我的问题之后,我发现Andy Hayden提出的另一个有趣的想法似乎很有帮助。 He's using a column to hold the results of a comparison of the difference with a time-delta. 他正在使用一列来保存差异与时间增量的比较结果。 After performing a cumsum() on the int-representation of the boolean results he uses groupby() to cluster entries of each ungapped-series into a DataFrameGroupBy -object. 在对布尔结果的int表示执行cumsum() ,他使用groupby()将每个无DataFrameGroupBy系列的条目聚类到DataFrameGroupBy

As this was written some time ago pandas now returns timedelta -objects so the comparison should be done with another timedelta -object like so (based on the mydf from above or on the reindexed df2 after copying its index to a now column through mydf2['timestamp'] = mydf2.index ): 正如timedelta写的那样,大熊猫现在返回timedelta -objects所以比较应该用另一个timedelta来完成(基于上面的mydf或者在通过mydf2['timestamp'] = mydf2.index将索引复制到now列之后重新索引的df2 mydf2['timestamp'] = mydf2.index ):

from datetime import timedelta
myTD = timedelta(minutes=60)
mydf['nogap'] = mydf['timestamp'].diff() > myTD
mydf['nogap'] = mydf['nogap'].apply(lambda x: 1 if x else 0).cumsum() 
## btw.: why not "... .apply(lambda x: int(x)) ..."?
dfg = mydf.groupby('nogap')

We now could iterate over the DataFrameGroup getting the ungapped series and do something with them. 我们现在可以遍历DataFrameGroup获取无空位系列并使用它们执行某些操作 My pandas/mathplot-skills are way too immature but could we plot the group-elements into sub-plots? 我的pandas / mathplot技能太不成熟了,但我们可以将组元素绘制成子图吗? maybe that way the discontinuity along the time-axis could be represented in some way (in form of an interrupted axis-line or such)? 也许这样,沿着时间轴的不连续性可以用某种方式表示(以中断的轴线等形式)?

piRSquared's answer already leads to a quite usable result with the only thing kind of missing being a more striking visual feedback along the time-axis that a gap/time-jump has occurred between two values. piRSquared的答案已经导致一个非常有用的结果,唯一的缺点是沿时间轴的更引人注目的视觉反馈,两个值之间出现间隙/时间跳跃。

Maybe with the grouped Sections the width of the gap-representation could be more configurable? 也许对于分组的部分,间隙表示的宽度可以更加可配置?

I built a new series and plotted it. 我建立了一个新的系列并绘制它。 This is not super elegant! 这不是超级优雅! But I believe gets you what you wanted. 但我相信能得到你想要的东西。

Setup 设定

Do this to get to your starting point 这样做是为了达到你的起点

from StringIO import StringIO
import pandas as pd

text = """          timestamp       val
2016-07-25 00:00:00   0.740442
2016-07-25 01:00:00   0.842911
2016-07-25 02:00:00  -0.873992
2016-07-25 07:00:00  -0.474993
2016-07-25 08:00:00  -0.983963
2016-07-25 09:00:00   0.597011
2016-07-25 10:00:00  -2.043023
2016-07-25 12:00:00   0.304668
2016-07-25 13:00:00   1.185997
2016-07-25 14:00:00   0.920850
2016-07-25 15:00:00   0.201423
2016-07-25 16:00:00   0.842970
2016-07-25 21:00:00   1.061207
2016-07-25 22:00:00   0.232180
2016-07-25 23:00:00   0.453964"""

s1 = pd.read_csv(StringIO(text),
                 index_col=0,
                 parse_dates=[0],
                 engine='python',
                 sep='\s{2,}').squeeze()

s1

timestamp
2016-07-25 00:00:00    0.740442
2016-07-25 01:00:00    0.842911
2016-07-25 02:00:00   -0.873992
2016-07-25 07:00:00   -0.474993
2016-07-25 08:00:00   -0.983963
2016-07-25 09:00:00    0.597011
2016-07-25 10:00:00   -2.043023
2016-07-25 12:00:00    0.304668
2016-07-25 13:00:00    1.185997
2016-07-25 14:00:00    0.920850
2016-07-25 15:00:00    0.201423
2016-07-25 16:00:00    0.842970
2016-07-25 21:00:00    1.061207
2016-07-25 22:00:00    0.232180
2016-07-25 23:00:00    0.453964
Name: val, dtype: float64

Resample hourly. 每小时重新采样。 resample is a deferred method, meaning it expects you to pass another method afterwards so it knows what to do. resample是一个延迟方法,意味着它希望您之后传递另一个方法,以便它知道该怎么做。 I used mean . 我用的mean For your example, it doesn't matter because we are sampling to a higher frequency. 就你的例子而言,这并不重要,因为我们正在以更高的频率进行采样。 Look it up if you care. 如果你关心,请查一查。

s2 = s1.resample('H').mean()

s2

timestamp
2016-07-25 00:00:00    0.740442
2016-07-25 01:00:00    0.842911
2016-07-25 02:00:00   -0.873992
2016-07-25 03:00:00         NaN
2016-07-25 04:00:00         NaN
2016-07-25 05:00:00         NaN
2016-07-25 06:00:00         NaN
2016-07-25 07:00:00   -0.474993
2016-07-25 08:00:00   -0.983963
2016-07-25 09:00:00    0.597011
2016-07-25 10:00:00   -2.043023
2016-07-25 11:00:00         NaN
2016-07-25 12:00:00    0.304668
2016-07-25 13:00:00    1.185997
2016-07-25 14:00:00    0.920850
2016-07-25 15:00:00    0.201423
2016-07-25 16:00:00    0.842970
2016-07-25 17:00:00         NaN
2016-07-25 18:00:00         NaN
2016-07-25 19:00:00         NaN
2016-07-25 20:00:00         NaN
2016-07-25 21:00:00    1.061207
2016-07-25 22:00:00    0.232180
2016-07-25 23:00:00    0.453964
Freq: H, Name: val, dtype: float64

Ok, so you also wanted equally sized gaps. 好的,所以你也想要同样大小的差距。 This was a tad tricky. 这有点棘手。 I used ffill(limit=1) to fill in only one space of each gap. 我使用ffill(limit=1)来填充每个间隙的一个空格。 Then I took the slice of s2 where this forward filled thing was not null. 然后我拿走了s2的切片,这个前向填充的东西不是空的。 This gives me a single null for each gap. 对于每个间隙,这给了我一个空值。

s3 = s2[s2.ffill(limit=1).notnull()]

s3

timestamp
2016-07-25 00:00:00    0.740442
2016-07-25 01:00:00    0.842911
2016-07-25 02:00:00   -0.873992
2016-07-25 03:00:00         NaN
2016-07-25 07:00:00   -0.474993
2016-07-25 08:00:00   -0.983963
2016-07-25 09:00:00    0.597011
2016-07-25 10:00:00   -2.043023
2016-07-25 11:00:00         NaN
2016-07-25 12:00:00    0.304668
2016-07-25 13:00:00    1.185997
2016-07-25 14:00:00    0.920850
2016-07-25 15:00:00    0.201423
2016-07-25 16:00:00    0.842970
2016-07-25 17:00:00         NaN
2016-07-25 21:00:00    1.061207
2016-07-25 22:00:00    0.232180
2016-07-25 23:00:00    0.453964
Name: val, dtype: float64

Lastly, if I plotted this, I still get irregular gaps. 最后,如果我绘制了这个,我仍然会有不规则的差距。 I need str indices so that matplotlib doesn't try to expand out my dates. 我需要str索引,以便matplotlib不会尝试扩展我的日期。

s3.reindex(s3.index.strftime('%H:%M'))

timestamp
00:00    0.740442
01:00    0.842911
02:00   -0.873992
03:00         NaN
07:00   -0.474993
08:00   -0.983963
09:00    0.597011
10:00   -2.043023
11:00         NaN
12:00    0.304668
13:00    1.185997
14:00    0.920850
15:00    0.201423
16:00    0.842970
17:00         NaN
21:00    1.061207
22:00    0.232180
23:00    0.453964
Name: val, dtype: float64

I'll plot them together so we can see the difference. 我将它们联系在一起,这样我们就能看出它们之间的区别。

f, a = plt.subplots(2, 1, sharey=True, figsize=(10, 5))
s2.plot(ax=a[0])
s3.reindex(s3.index.strftime('%H:%M')).plot(ax=a[1])

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM