簡體   English   中英

檢查每個用戶在 python 3 pandas 數據框中是否有連續的日期

[英]check if each user has consecutive dates in a python 3 pandas dataframe

想象一下有一個數據框:

   id        date  balance_total  transaction_total
0   1  01/01/2019          102.0               -1.0
1   1  01/02/2019          100.0               -2.0
2   1  01/03/2019          100.0                NaN
3   1  01/04/2019          100.0                NaN
4   1  01/05/2019           96.0               -4.0
5   2  01/01/2019          200.0               -2.0
6   2  01/02/2019          100.0               -2.0
7   2  01/04/2019          100.0                NaN
8   2  01/05/2019           96.0               -4.0

這是創建數據幀命令:

import pandas as pd
import numpy as np

users=pd.DataFrame(
                [
                {'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
                {'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
                {'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':100},
                {'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
                {'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
                {'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
                {'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
                {'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
                {'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':96}  
                ]
                )

我如何檢查每個 id 是否有連續的日期? 我在這里使用“轉變”的想法,但它似乎不起作用:

計算兩行之間的時間差

df['index_col'] = df.index

for id in df['id'].unique():

    # create an empty QA dataframe

    column_names = ["Delta"]
    df_qa = pd.DataFrame(columns = column_names)

    df_qa['Delta']=(df['index_col'] - df['index_col'].shift(1))

    if (df_qa['Delta'].iloc[1:] != 1).any() is True:

        print('id ' + id +' might have non-consecutive dates')

        # doesn't print any account => Each Customer's Daily Balance has Consecutive Dates
    break

理想輸出:

it should print id 2 might have non-consecutive dates

謝謝!

使用groupbydiff

df["date"] = pd.to_datetime(df["date"],format="%m/%d/%Y")

df["difference"] = df.groupby("id")["date"].diff()

print (df.loc[df["difference"]>pd.Timedelta(1, unit="d")])

#
   id       date  transaction_total  balance_total difference
7   2 2019-01-04                NaN          100.0     2 days

DataFrameGroupBy.diffSeries.dt.days DataFrameGroupBy.diff使用,通過像1這樣的DataFrameGroupBy.diff進行Series.dt.days ,並通過DataFrame.loc僅過濾id列:

users['date'] = pd.to_datetime(users['date'])

i = users.loc[users.groupby('id')['date'].diff().dt.days.gt(1), 'id'].tolist()
print (i)
[2]

for val in i:
    print( f'id {val} might have non-consecutive dates')
id 2 might have non-consecutive dates

第一步是解析date

users['date'] = pd.to_datetime(users.date)

然后在 id 和 date 列上添加一個移位列:

users['id_shifted'] = users.id.shift(1)
users['date_shifted'] = users.date.shift(1)

datedate_shifted列之間的區別很有趣:

>>> users.date - users.date_shifted

0       NaT
1    1 days
2    1 days
3    1 days
4    1 days
5   -4 days
6    1 days
7    2 days
8    1 days
dtype: timedelta64[ns]

您現在可以查詢 DataFrame 以獲取所需內容:

users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]

也就是說,同一用戶的連續行,日期相差 != 1 天。

此解決方案確實假設數據按 (id, date) 排序。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM