簡體   English   中英

Pandas - 用 Nan 替換重復項並保留行

[英]Pandas - Replace Duplicates with Nan and Keep Row

如何在保留行的同時用 NaN 替換每個組的重復項?

我需要在不刪除的情況下保留行,並且可能將第一個原始值保留在它最先出現的位置。

import pandas as pd
from datetime import timedelta

df = pd.DataFrame({
    'date': ['2019-01-01 00:00:00','2019-01-01 01:00:00','2019-01-01 02:00:00', '2019-01-01 03:00:00',
             '2019-09-01 02:00:00','2019-09-01 03:00:00','2019-09-01 04:00:00', '2019-09-01 05:00:00'],
    'value': [10,10,10,10,12,12,12,12],
    'ID': ['Jackie','Jackie','Jackie','Jackie','Zoop','Zoop','Zoop','Zoop',]
})

df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)


date    value   ID
0   2019-01-01 00:00:00 10  Jackie
1   2019-01-01 01:00:00 10  Jackie
2   2019-01-01 02:00:00 10  Jackie
3   2019-01-01 03:00:00 10  Jackie
4   2019-09-01 02:00:00 12  Zoop
5   2019-09-01 03:00:00 12  Zoop
6   2019-09-01 04:00:00 12  Zoop
7   2019-09-01 05:00:00 12  Zoop

所需 Dataframe:

date    value   ID
0   2019-01-01 00:00:00 10  Jackie
1   2019-01-01 01:00:00 NaN Jackie
2   2019-01-01 02:00:00 NaN Jackie
3   2019-01-01 03:00:00 NaN Jackie
4   2019-09-01 02:00:00 12  Zoop
5   2019-09-01 03:00:00 NaN Zoop
6   2019-09-01 04:00:00 NaN Zoop
7   2019-09-01 05:00:00 NaN Zoop

編輯:

重復值只應在與頻率無關的同一日期刪除。 因此,如果值 10 在 Jan-1 出現兩次,在 Jan-2 出現三次,那么值 10 應該只在 Jan-1 出現一次,在 Jan-2 出現一次。

我假設您檢查列valueID的重復項並進一步檢查列datedate

df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = np.nan

Out[269]:
                 date  value      ID
0 2019-01-01 00:00:00   10.0  Jackie
1 2019-01-01 01:00:00    NaN  Jackie
2 2019-01-01 02:00:00    NaN  Jackie
3 2019-01-01 03:00:00    NaN  Jackie
4 2019-09-01 02:00:00   12.0    Zoop
5 2019-09-01 03:00:00    NaN    Zoop
6 2019-09-01 04:00:00    NaN    Zoop
7 2019-09-01 05:00:00    NaN    Zoop

正如@Trenton 建議的那樣,您可以使用pd.NA來避免導入 numpy

注意:@rafaelc 建議:這里是解釋pd.NAnp.nan https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na之間的詳細差異的鏈接-標量表示缺失值

df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = pd.NA

Out[273]:
                 date value      ID
0 2019-01-01 00:00:00    10  Jackie
1 2019-01-01 01:00:00  <NA>  Jackie
2 2019-01-01 02:00:00  <NA>  Jackie
3 2019-01-01 03:00:00  <NA>  Jackie
4 2019-09-01 02:00:00    12    Zoop
5 2019-09-01 03:00:00  <NA>    Zoop
6 2019-09-01 04:00:00  <NA>    Zoop
7 2019-09-01 05:00:00  <NA>    Zoop

如果 dataframe 已排序,則此方法有效 - 如您的示例所示:

import numpy as np                                    # to be used for np.nan

df['duplicate'] = df['value'].shift(1)                # create a duplicate column 
df['value'] = df.apply(lambda x: np.nan if x['value'] == x['duplicate'] \
                          else x['value'], axis=1)    # conditional replace
df = df.drop('duplicate', axis=1)                     # drop helper column

對日期進行分組,取第一個觀測值(按時間排序時不一定是第一個),然后將結果合並回原來的 dataframe。

df2 = df.groupby([df['date'].dt.date, 'ID'], as_index=False).first()
>>> df.drop(columns='value').merge(df2, on=['date', 'ID'], how='left')[df.columns]
                 date  value      ID
0 2019-01-01 00:00:00   10.0  Jackie
1 2019-01-01 01:00:00    NaN  Jackie
2 2019-01-01 02:00:00    NaN  Jackie
3 2019-01-01 03:00:00    NaN  Jackie
4 2019-09-01 02:00:00   12.0    Zoop
5 2019-09-01 03:00:00    NaN    Zoop
6 2019-09-01 04:00:00    NaN    Zoop
7 2019-09-01 05:00:00    NaN    Zoop

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM