简体   繁体   English

用多个组插补缺失值

[英]Imputing Missing Values with Multiple Groups

I am merging monthly data with quarterly financial data for different companies in Python.我正在将 Python 中不同公司的月度数据与季度财务数据合并。 Each stock has monthly data for some columns, and only quarterly data for others.每只股票的某些列都有月度数据,而其他列只有季度数据。 Below is a sample dataframe.下面是一个示例 dataframe。

import numpy as np
import pandas as pd 
raw_data = {'gvkey': [1004, 1004, 1004, 1004, 1004, 1004, 1045, 1045, 1045, 1045, 1045, 1045,], 
        'date': ['2018-08-31', '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31', '2019-01-31', '2018-08-31', '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31', '2019-01-31'], 
        'trt1m': [-1.5609, 2.6141, -0.4907, -8.1757, -14.5342, 1.1114, -0.2488, -14.939, 5.6241, 8.5137, 2.3091, -7.335], 
        'epsfxq': [np.NaN, 0.52, np.NaN, np.NaN,  .54, np.NaN, np.NaN, -.28, np.NaN, np.NaN, -3.29, np.NaN],
        'roa': [0.079, 0.079, 0.079, 0.082, 0.082, 0.082, .104, .104, .104, .090, .090, .090]}

df = pd.DataFrame(raw_data, columns = ['gvkey', 'date', 'trt1m', 'epsfxq', 'roa'])
df.head(12)

I am trying to impute missing data for the NaN values I have in my data frame, however, when I groupby the date or the gvkey (read: StockID), I am able to do a forward fill (ffill) or backward fill (bfill) successfully to the missing values, however I lose the date and gvkey columns when I do this.我试图为我的数据框中的 NaN 值估算缺失的数据,但是,当我按日期或 gvkey(读取:StockID)分组时,我可以进行前向填充(ffill)或后向填充(bfill ) 成功删除了缺失值,但是当我这样做时我会丢失日期和 gvkey 列。

Does anyone have any advice on how to impute these missing values for multiple groups (grouped by date and gvkey, in this example? I would greatly appreciate any advice you can give.是否有人对如何为多个组估算这些缺失值有任何建议(在此示例中按日期和 gvkey 分组?我将非常感谢您提供的任何建议。

Thank you谢谢

df.fillna(method='ffill') should do the trick, no need to group. df.fillna(method='ffill')应该可以解决问题,无需分组。

ADDITION To answer the OP concern:补充回答OP的问题:

ll=[]
for i, j in df.groupby(gvkey):
   ll.append(j.fillna(method='ffill')
newdf = pd.concat(ll)

This works:这有效:

fill_cols = ['epsfxq']
df[fill_cols] = df.groupby(['gvkey'])[fill_cols].ffill()
df[fill_cols] = df.groupby(['gvkey'])[fill_cols].bfill()
df.head(12)

Thanks for your help.谢谢你的帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM