简体   繁体   English

使用 groupby Pandas ValueError 填充缺失值

[英]Fill missing values using groupby Pandas ValueError

I am trying to fill the missing values in the 'Date' column of my dataset.我正在尝试在我的数据集的“日期”列中填充缺失值。

CODE        City        Date         TAVG    TMAX   TMIN
CA003033890 Lethbridge  08-01-2020  -3.55    4.7    -11.8
CA003033890 Lethbridge  09-01-2020  -17.05  -11.5   -22.6
CA003033890 Lethbridge  10-01-2020  -13.7   -1.9    -25.5
CA003033890 Lethbridge  11-01-2020  -7.8     0.7    -16.3
CA003033890 Lethbridge  12-01-2020  -20.3   -16.3   -24.3
CA003033890 Lethbridge  13-01-2020  -24.6   -22.4   -26.8
CA003033890 Lethbridge  14-01-2020  -27     -23.7   -30.3
CA003033890 Lethbridge  15-01-2020  -29.55  -26.8   -32.3
CA003033890 Lethbridge  16-01-2020  -26.05  -23.2   -28.9
CA003033890 Lethbridge  17-01-2020  -23.45  -19.2   -27.7

For the above code CA003033890 , notice that the dates from 01-01-2020 to 07-01-2020 are missing, similarly for other CODE s, Date column values are randomly missing.对于上面的代码CA003033890 ,请注意缺少从 01-01-2020 到 07-01-2020 的日期,对于其他CODE类似, Date列值随机丢失。

This is the code that I tried这是我尝试过的代码

data.Date=pd.to_datetime(data.Date)
merge_df  = data.set_index('Date').groupby('CODE').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)

When I ran it, it seemed to run forever and later returned the below error当我运行它时,它似乎永远运行,后来返回以下错误

Traceback (most recent call last):                                                                                                                                                                                                                      
  File "test.py", line 45, in <module>                                                                                                                                                                                                                  
    data['Date'] = data.groupby('CODE')['Date'].apply(lambda d: d.reindex(pd.date_range(min(df1.Date),max(df1.Date),freq='D'))).drop('CODE', axis=1).reset_index('CODE').fillna(value=None)                                                             
  File "C:\Python\Python38\lib\site-packages\pandas\core\series.py", line 4132, in drop                                                                                                                                                                 
    return super().drop(                                                                                                                                                                                                                                
  File "C:\Python\Python38\lib\site-packages\pandas\core\generic.py", line 3923, in drop                                                                                                                                                                
    axis_name = self._get_axis_name(axis)                                                                                                                                                                                                               
  File "C:\Python\Python38\lib\site-packages\pandas\core\generic.py", line 420, in _get_axis_name                                                                                                                                                       
    raise ValueError(f"No axis named {axis} for object type {cls}")                                                                                                                                                                                     
ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> 

Expected Dataset(NaNs in the datset)预期数据集(数据集中的 NaN)

CODE        City        Date        TAVG    TMAX    TMIN
CA003033890 Lethbridge  01-01-2020          
CA003033890 Lethbridge  02-01-2020          
CA003033890 Lethbridge  03-01-2020          
CA003033890 Lethbridge  04-01-2020          
CA003033890 Lethbridge  05-01-2020          
CA003033890 Lethbridge  06-01-2020          
CA003033890 Lethbridge  07-01-2020          
CA003033890 Lethbridge  08-01-2020  -3.55    4.7    -11.8
CA003033890 Lethbridge  09-01-2020  -17.05  -11.5   -22.6
CA003033890 Lethbridge  10-01-2020  -13.7   -1.9    -25.5

Also, is there a faster way to achieve this?另外,有没有更快的方法来实现这一点?

You could create a MultiIndex and reindex then reset_index in each group:您可以在每个组中创建一个 MultiIndex 和reindex然后reset_index

df_list = []

for (code, group) in df.groupby('CODE'):
    idx = pd.MultiIndex.from_product([group['CODE'].unique(),
                                      pd.date_range(group['Date'].max().replace(day=1), end=group['Date'].max(), freq='D')],
                                     names=['CODE', 'Date'])
    group = group.set_index(['CODE', 'Date']).reindex(idx).reset_index()
    group['City'] = group['City'].fillna(method='bfill')
    df_list.append(group)

new_df = pd.concat(df_list, ignore_index=True)

A MWE:一个 MWE:

import sys
import pandas as pd
from io import StringIO

TESTDATA = StringIO("""CODE        City        Date         TAVG    TMAX   TMIN
CA003033890 Lethbridge  08-01-2020  -3.55    4.7    -11.8
CA003033890 Lethbridge  09-01-2020  -17.05  -11.5   -22.6
CA003033890 Lethbridge  10-01-2020  -13.7   -1.9    -25.5
CA003033890 Lethbridge  11-01-2020  -7.8     0.7    -16.3
CA003033890 Lethbridge  12-01-2020  -20.3   -16.3   -24.3
CA003033890 Lethbridge  13-01-2020  -24.6   -22.4   -26.8
CA003033890 Lethbridge  14-01-2020  -27     -23.7   -30.3
CA003033890 Lethbridge  15-01-2020  -29.55  -26.8   -32.3
CA003033890 Lethbridge  16-01-2020  -26.05  -23.2   -28.9
CA003033890 Lethbridge  17-01-2020  -23.45  -19.2   -27.7
CA003033891 abc         11-01-2020  -24.6   -22.4   -26.8
CA003033891 abc         14-01-2020  -27     -23.7   -30.3
CA003033891 abc         15-01-2020  -23.45  -19.2   -27.7
    """)

df = pd.read_csv(TESTDATA, delim_whitespace=True)

df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')

df_list = []

for (code, group) in df.groupby('CODE'):
    idx = pd.MultiIndex.from_product([group['CODE'].unique(),
                                      pd.date_range(group['Date'].max().replace(day=1), end=group['Date'].max(), freq='D')],
                                     names=['CODE', 'Date'])
    group = group.set_index(['CODE', 'Date']).reindex(idx).reset_index()
    group['City'] = group['City'].fillna(method='bfill')
    df_list.append(group)

new_df = pd.concat(df_list, ignore_index=True)
# print(new_df)

           CODE       Date        City   TAVG  TMAX  TMIN
0   CA003033890 2020-01-01  Lethbridge    NaN   NaN   NaN
1   CA003033890 2020-01-02  Lethbridge    NaN   NaN   NaN
2   CA003033890 2020-01-03  Lethbridge    NaN   NaN   NaN
3   CA003033890 2020-01-04  Lethbridge    NaN   NaN   NaN
4   CA003033890 2020-01-05  Lethbridge    NaN   NaN   NaN
5   CA003033890 2020-01-06  Lethbridge    NaN   NaN   NaN
6   CA003033890 2020-01-07  Lethbridge    NaN   NaN   NaN
7   CA003033890 2020-01-08  Lethbridge  -3.55   4.7 -11.8
8   CA003033890 2020-01-09  Lethbridge -17.05 -11.5 -22.6
9   CA003033890 2020-01-10  Lethbridge -13.70  -1.9 -25.5
10  CA003033890 2020-01-11  Lethbridge  -7.80   0.7 -16.3
11  CA003033890 2020-01-12  Lethbridge -20.30 -16.3 -24.3
12  CA003033890 2020-01-13  Lethbridge -24.60 -22.4 -26.8
13  CA003033890 2020-01-14  Lethbridge -27.00 -23.7 -30.3
14  CA003033890 2020-01-15  Lethbridge -29.55 -26.8 -32.3
15  CA003033890 2020-01-16  Lethbridge -26.05 -23.2 -28.9
16  CA003033890 2020-01-17  Lethbridge -23.45 -19.2 -27.7
17  CA003033891 2020-01-01         abc    NaN   NaN   NaN
18  CA003033891 2020-01-02         abc    NaN   NaN   NaN
19  CA003033891 2020-01-03         abc    NaN   NaN   NaN
20  CA003033891 2020-01-04         abc    NaN   NaN   NaN
21  CA003033891 2020-01-05         abc    NaN   NaN   NaN
22  CA003033891 2020-01-06         abc    NaN   NaN   NaN
23  CA003033891 2020-01-07         abc    NaN   NaN   NaN
24  CA003033891 2020-01-08         abc    NaN   NaN   NaN
25  CA003033891 2020-01-09         abc    NaN   NaN   NaN
26  CA003033891 2020-01-10         abc    NaN   NaN   NaN
27  CA003033891 2020-01-11         abc -24.60 -22.4 -26.8
28  CA003033891 2020-01-12         abc    NaN   NaN   NaN
29  CA003033891 2020-01-13         abc    NaN   NaN   NaN
30  CA003033891 2020-01-14         abc -27.00 -23.7 -30.3
31  CA003033891 2020-01-15         abc -23.45 -19.2 -27.7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM