简体   繁体   English

使用Python Pandas填补面板数据中的空白

[英]Fill in gaps in panel data using Python Pandas

Consider an unbalanced panel where the gaps are informative (eg true zeros). 考虑一个不平衡的面板,其中的差距是有益的(例如,真零)。 I would like to add the zeros back in. Essentially, I am trying to recreate the functionality of the stata function, tsfill , in pandas. 我想补充的零回。从本质上讲,我试图重建STATA功能的功能性, tsfill ,在大熊猫。

Example data (I construct a balanced panel, and remove some of the observations): 示例数据(我构建了一个平衡的面板,并删除了一些观察结果):

import numpy as np
import pandas as pd
import datetime

np.random.seed(123456)

all_dates = pd.DataFrame(pd.date_range(datetime.date(2015,1,1),datetime.date(2015,12,31)),columns=['date'])
balanced_data=all_dates.copy()
balanced_data['id']=0
for x in range(99):
    appendme=all_dates
    appendme['id']=x+1
    balanced_data=balanced_data.append(appendme)

balanced_data.reset_index(inplace=True,drop=True)
balanced_data['random']=np.random.random_sample(balanced_data.shape[0])>=0.5

# remove some data
unbalanced_data=balanced_data[balanced_data['random']==1].reset_index(drop=True)

One way to make the panel balanced again is to merge the unbalanced panel to a dataframe with balanced id and date columns: 使面板再次保持平衡的一种方法是将不平衡的面板合并到具有平衡的id和date列的数据框:

# construct one full set of dates for everyone
all_dates = pd.DataFrame(pd.date_range(unbalanced_data['date'].min(),unbalanced_data['date'].max()),columns=['date'])

length = unbalanced_data['id'].unique().size
all_dates_full=all_dates
for x in range(length-1):
    all_dates_full=all_dates_full.append(all_dates)

all_dates_full.reset_index(inplace=True,drop=True)

# duplicate ids to match the number of dates 
length = all_dates.size
ids=unbalanced_data['id'].drop_duplicates()
ids_full=ids
for x in range(length-1):
    ids_full=ids_full.append(ids)

ids_full.sort_values(inplace=True)
ids_full.reset_index(inplace=True,drop=True)

balanced_panel = pd.concat([all_dates_full,ids_full],axis=1)

rebalanced_data=pd.merge(balanced_panel,unbalanced_data,how='left',on=['id','date'])
rebalanced_data.fillna(False,inplace=True)

# check
balanced_data==rebalanced_data

In addition to being clunky, I find this approach is really slow as N gets big. 除了笨拙外,我发现随着N变大,这种方法确实很慢。 I figured there must be a more efficient way to rebalance the panel, but I couldn't find it. 我认为必须有一种更有效的方法来重新平衡面板,但是我找不到它。

(PS This is my first question on stackoverflow, so any constructive criticism for future questions very much appreciated!) (PS:这是我关于stackoverflow的第一个问题,因此,对于未来问题的任何建设性批评都将不胜感激!)

As far as performance goes, appending dataframes in pandas is a slow operation when compared to appending lists. 就性能而言,与附加列表相比,在熊猫中附加数据帧是一项缓慢的操作。 Indexes are immutable, so a new index is created each time you append. 索引是不可变的,因此每次添加时都会创建一个新索引。 Here is a solution that builds collections outside of pandas and then joins them into a dataframe. 这是一个解决方案,可在大熊猫外部构建馆藏,然后将它们加入数据框。

uid = unbalanced_data['id'].unique()
ids_full = np.array([[x]*len(all_dates) for x in range(len(uid))]).flatten()
dates = all_dates['date'].tolist() * len(uid)
balanced_panel = pd.DataFrame({'id': ids_full, 'date': dates})
rebalanced_data = pd.merge(balanced_panel, unbalanced_data, how=‌​'left',
                           on=['id', 'dat‌​e']).fillna(False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM