简体   繁体   English

基于长表格式的 FirstYear 和 LastYear 估算值

[英]Imputing Values Based on FirstYear and LastYear in Long Table Format

I have a long table on firm-level that has the first and last active year and their zip code.我有一张关于公司级别的长表,其中包含第一个和最后一个活跃年份及其 zip 代码。

pd.DataFrame({'Firm':['A','B','C'],
         'FirstYear':[2020, 2019, 2018],
         'LastYear':[2021, 2022, 2019],
         'Zipcode':['00000','00001','00003']})


Firm    FirstYear   LastYear    Zipcode
A   2020    2021    00000
B   2019    2022    00001
C   2018    2019    00003

I want to get the panel data that has the zipcode for every active year.我想获取包含每个活跃年份的邮政编码的面板数据。 So ideally I might want a wide table that impute the value of Zipcode based on first year and last year, and every year between the first and last year .所以理想情况下,我可能想要一个宽表,根据第一年和去年以及第一年和最后一年之间的每一年来估算 Zipcode 的值。

It should look like this:它应该是这样的:

    2020    2021    2019    2022    2018
A   00000   00000           
B   00001   00001   00001   00001   
C                   00003          00003

I have some code to create a long table per row but I have many millions of rows and it takes a long time.我有一些代码可以为每行创建一个长表,但我有数百万行并且需要很长时间。 What's the best way in terms of performance and memory use to transform the long table I have to impute every year's zipcode value in pandas?就性能而言,memory 用于转换长表的最佳方法是什么,我必须在 pandas 中估算每年的邮政编码值?

Thanks in advance.提前致谢。

Responding to the answer's update: Imagine there is a firm whose first and last year didn't overlap with other firms.回应答案的更新:假设有一家公司的第一年和最后一年没有与其他公司重叠。

df=pd.DataFrame({'Firm':['A','B','C'],
         'FirstYear':[2020, 2019, 1997],
         'LastYear':[2021, 2022, 2008],
         'Zipcode':['00000','00001','00003']})

The output from the code is like:代码中的 output 是这样的:

Firm    2020    2021    2019    2022    1997    2008
A       00000   00000               
B       00001   00001   00001   00001       
C                                      00003    00003

Here is a solution with pd.melt()这是pd.melt()的解决方案

d = (pd.melt(df,id_vars=['Firm','Zipcode'])
.set_index(['Firm','value'])['Zipcode']
.unstack(level=1))

d = (d.ffill(axis=1)
.where(d.ffill(axis=1).notna() & 
d.bfill(axis=1).notna())
.reindex(df[['FirstYear','LastYear']].stack().unique(),axis=1))

Original Answer:原答案:

(pd.melt(df,id_vars=['Firm','Zipcode'])
.set_index(['Firm','value'])['Zipcode']
.unstack(level=1)
.reindex(df[['FirstYear','LastYear']].stack().unique(),axis=1))

Output: Output:

value   2020   2021   2019   2022   2018
Firm                                    
A      00000  00000    NaN    NaN    NaN
B      00001  00001  00001  00001    NaN
C        NaN    NaN  00003    NaN  00003

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM