简体   繁体   English

如何汇总一个Pandas Dataframe中时间序列数据的缺失值?

[英]How to summarize missing values in time series data in a Pandas Dataframe?

I'm having a timeseries dataset like the following:我有一个如下所示的时间序列数据集:

在此处输入图像描述

As seen, there are three columns for channel values paired against the same set of timestamps.如图所示,通道值有三列与同一组时间戳配对。 Each channel has sets of NaN values.每个通道都有一组 NaN 值。

My objective is to create a summary of these NaN values as follows:我的目标是创建这些 NaN 值的摘要,如下所示: 在此处输入图像描述

My approach (inefficient): Create a for loop to go across each channel column first, and then another nested for loop to go across each row of the channel.我的方法(效率低下):首先在每个通道列中创建一个到 go 的 for 循环,然后在通道的每一行中创建另一个嵌套的 for 循环到 go。 Then when it stumbles across NaN value sets, it can register the start timestamp, end timestamp and duration in the form of individual rows (or lists), which I can eventually stack together as the final output.然后当它偶然发现 NaN 值集时,它可以以单独行(或列表)的形式注册开始时间戳、结束时间戳和持续时间,我最终可以将它们堆叠在一起作为最终的 output。

But my logic seems pretty inefficient and slow especially considering that my original dataset has 200 channel columns and 10k rows.但是我的逻辑似乎效率很低而且很慢,尤其是考虑到我的原始数据集有 200 个通道列和 10k 行。 I'm sure there should be a better approach than this in Python.我确信在 Python 中应该有比这更好的方法。

Can anyone please help me out with an appropriate way to deal with this - using Pandas in Python?谁能帮我解决这个问题——在 Python 中使用 Pandas?

Use DataFrame.melt for reshape DataFrame, then filter consecutive groups by misisng values and next value after missing and create new DataFrame by aggregation min with max values:使用DataFrame.melt重塑 DataFrame,然后通过 misisng 值和缺失后的下一个值过滤连续组,并通过聚合minmax创建新的DataFrame

df['date_time'] = pd.to_datetime(df['date_time'])

df1 = df.melt('date_time', var_name='Channel No.')
m = df1['value'].shift(fill_value=False).notna() #
mask = df1['value'].isna() | ~m


df1 = (df1.groupby([m.cumsum()[mask], 'Channel No.'])
          .agg(Starting_Timestamp = ('date_time','min'),
               Ending_Timestamp = ('date_time','max'))
          .assign(Duration = lambda x: x['Ending_Timestamp'].sub(x['Starting_Timestamp']))
          .droplevel(0)
          .reset_index()
        )

print (df1)
  Channel No.  Starting_Timestamp    Ending_Timestamp        Duration
0   Channel_1 2019-09-19 10:59:00 2019-09-19 14:44:00 0 days 03:45:00
1   Channel_1 2019-09-19 22:14:00 2019-09-19 23:29:00 0 days 01:15:00
2   Channel_2 2019-09-19 13:59:00 2019-09-19 19:44:00 0 days 05:45:00
3   Channel_3 2019-09-19 10:59:00 2019-09-19 12:44:00 0 days 01:45:00
4   Channel_3 2019-09-19 15:14:00 2019-09-19 16:44:00 0 days 01:30:00

Use:利用:

inds = df[df['g'].isna()].index.to_list()
gs = []
s = 0
for i, x in enumerate(inds):
    if i<len(inds)-1:
        if x+1!=inds[i+1]:
            gs.append(inds[s:i+1])
            s = i+1
    else:
        gs.append(inds[s:i+1])
        
ses = []
for g in gs:
    ses.append([df.iloc[g[0]]['date'], df.iloc[g[-1]+1]['date']])

res = pd.DataFrame(ses, columns = ['st', 'et'])
res['d'] = res['et']-res['st']

And a more efficient solution:和一个更有效的解决方案:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date':pd.date_range('2021-01-01', '2021-12-01', 12), 'g':range(12)})
df['g'].loc[0:3]=np.nan
df['g'].loc[5:7]=np.nan

inds = df[df['g'].isna().astype(int).diff()==-1].index+1
pd.DataFrame([(x.iloc[0]['date'], x.iloc[-1]['date']) for x in np.array_split(df, inds) if np.isnan(x['g'].iloc[0])])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM