简体   繁体   English

熊猫在缺少日期时间时填充行数

[英]Pandas fill the counting of rows on missing datetime

I have a dataframe with a timestamp column.我有一个带有时间戳列的数据框。 I'm able to group by the rows of this dataframe by timestamps in the range of 10 minutes, as you can see from the code below我可以按 10 分钟范围内的时间戳对该数据帧的行进行分组,正如您从下面的代码中看到的那样

minutes = '10T'
grouped_df=df.loc[df['id_area'] == 3].groupby(pd.to_datetime(df["timestamp"]).dt.floor(minutes))["x"].count()

When I print the dataframe I get this当我打印数据框时,我得到了这个

timestamp
2022-11-09 14:10:00    2
2022-11-09 14:20:00    1
2022-11-09 15:10:00    1
2022-11-09 15:30:00    1
2022-11-09 16:10:00    2
Name: x, dtype: int64

So as you can see for example between 14:20 and15:10 there no values.因此,正如您所看到的,例如在 14:20 和 15:10 之间没有任何值。 I need to fill these steps with 0. How can I do it?我需要用0填充这些步骤。我该怎么做?

Data sample :数据样本

np.random.seed(2022)

N = 20
df = pd.DataFrame({'id_area':np.random.choice([1,2,3], size=N),
                  'x':np.random.choice([1,np.nan], size=N),
                   'timestamp':pd.date_range('2022-11-11', freq='7Min', periods=N)})

If need only add missing datetimes in DatetimeIndex add Series.asfreq :如果只需要在DatetimeIndex中添加缺少的日期时间,请添加Series.asfreq

minutes = '10T'
grouped_df1=(df.loc[df['id_area'] == 3]
              .groupby(pd.to_datetime(df["timestamp"]).dt.floor(minutes))["x"]
              .count()
              .asfreq(minutes, fill_value=0))

print (grouped_df1)
timestamp
2022-11-11 00:50:00    1
2022-11-11 01:00:00    0
2022-11-11 01:10:00    0
2022-11-11 01:20:00    0
2022-11-11 01:30:00    0
2022-11-11 01:40:00    0
2022-11-11 01:50:00    0
2022-11-11 02:00:00    1
Freq: 10T, Name: x, dtype: int64

Or use Grouper :或者使用Grouper

minutes = '10T'
grouped_df1=(df.assign(timestamp = pd.to_datetime(df["timestamp"]))
               .loc[df['id_area'] == 3]
               .groupby(pd.Grouper(freq=minutes, key='timestamp'))["x"]
              .count())

print (grouped_df1)
timestamp
2022-11-11 00:50:00    1
2022-11-11 01:00:00    0
2022-11-11 01:10:00    0
2022-11-11 01:20:00    0
2022-11-11 01:30:00    0
2022-11-11 01:40:00    0
2022-11-11 01:50:00    0
2022-11-11 02:00:00    1
Freq: 10T, Name: x, dtype: int64

If need count not matched values to 0 replace x to NaN in Series.where :如果需要将不匹配的值计数为0 ,请将Series.where中的x替换为NaN

grouped_df2=(df['x'].where(df['id_area'] == 3)
                   .groupby(pd.to_datetime(df["timestamp"]).dt.floor(minutes))
                   .count())
print (grouped_df2)  
timestamp
2022-11-11 00:00:00    0
2022-11-11 00:10:00    0
2022-11-11 00:20:00    0
2022-11-11 00:30:00    0
2022-11-11 00:40:00    0
2022-11-11 00:50:00    1
2022-11-11 01:00:00    0
2022-11-11 01:10:00    0
2022-11-11 01:20:00    0
2022-11-11 01:30:00    0
2022-11-11 01:40:00    0
2022-11-11 01:50:00    0
2022-11-11 02:00:00    1
2022-11-11 02:10:00    0
Name: x, dtype: int64

For clarity, you can always create a parallel dataframe that contains every date you need (in this case, in 10 minute intervals)为清楚起见,您始终可以创建一个并行数据框,其中包含您需要的每个日期(在本例中,以 10 分钟为间隔)

grouped_df = grouped_df.reset_index()
times = pd.date_range(start=grouped_df['time'].min(), end=grouped_df['time'].max(), freq='10min')

Now, all the dates you need should be in the times object:现在,您需要的所有日期都应该在 times 对象中:

    times:
DatetimeIndex(['2022-11-09 14:10:00', '2022-11-09 14:20:00',
               '2022-11-09 14:30:00', '2022-11-09 14:40:00',
               '2022-11-09 14:50:00', '2022-11-09 15:00:00',
               '2022-11-09 15:10:00', '2022-11-09 15:20:00',
               '2022-11-09 15:30:00', '2022-11-09 15:40:00',
               '2022-11-09 15:50:00', '2022-11-09 16:00:00',
               '2022-11-09 16:10:00'],
              dtype='datetime64[ns]', freq='10T')

We can then join the previous dataframe grouped_df and fill the blank values with zeroes.然后我们可以加入之前的数据框 grouped_df 并用零填充空白值。

final_df = pd.merge(grouped_df, pd.DataFrame(times, columns=['time']), how='outer', on='time').sort_values('time').fillna(0)

Your end result should look a lot like this (please, keep in mind i made up some values to reproduce your original result):你的最终结果应该看起来很像这样(请记住我做了一些值来重现你的原始结果):

        time           values
0   2022-11-09 14:10:00 10.0
1   2022-11-09 14:20:00 5.0
2   2022-11-09 14:30:00 0.0
3   2022-11-09 14:40:00 0.0
4   2022-11-09 14:50:00 0.0
5   2022-11-09 15:00:00 0.0
6   2022-11-09 15:10:00 20.0
7   2022-11-09 15:20:00 0.0
8   2022-11-09 15:30:00 15.0
9   2022-11-09 15:40:00 0.0
10  2022-11-09 15:50:00 0.0
11  2022-11-09 16:00:00 0.0
12  2022-11-09 16:10:00 30.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM