简体   繁体   中英

Imputing Values Based on FirstYear and LastYear in Long Table Format

I have a long table on firm-level that has the first and last active year and their zip code.

pd.DataFrame({'Firm':['A','B','C'],
         'FirstYear':[2020, 2019, 2018],
         'LastYear':[2021, 2022, 2019],
         'Zipcode':['00000','00001','00003']})


Firm    FirstYear   LastYear    Zipcode
A   2020    2021    00000
B   2019    2022    00001
C   2018    2019    00003

I want to get the panel data that has the zipcode for every active year. So ideally I might want a wide table that impute the value of Zipcode based on first year and last year, and every year between the first and last year .

It should look like this:

    2020    2021    2019    2022    2018
A   00000   00000           
B   00001   00001   00001   00001   
C                   00003          00003

I have some code to create a long table per row but I have many millions of rows and it takes a long time. What's the best way in terms of performance and memory use to transform the long table I have to impute every year's zipcode value in pandas?

Thanks in advance.

Responding to the answer's update: Imagine there is a firm whose first and last year didn't overlap with other firms.

df=pd.DataFrame({'Firm':['A','B','C'],
         'FirstYear':[2020, 2019, 1997],
         'LastYear':[2021, 2022, 2008],
         'Zipcode':['00000','00001','00003']})

The output from the code is like:

Firm    2020    2021    2019    2022    1997    2008
A       00000   00000               
B       00001   00001   00001   00001       
C                                      00003    00003

Here is a solution with pd.melt()

d = (pd.melt(df,id_vars=['Firm','Zipcode'])
.set_index(['Firm','value'])['Zipcode']
.unstack(level=1))

d = (d.ffill(axis=1)
.where(d.ffill(axis=1).notna() & 
d.bfill(axis=1).notna())
.reindex(df[['FirstYear','LastYear']].stack().unique(),axis=1))

Original Answer:

(pd.melt(df,id_vars=['Firm','Zipcode'])
.set_index(['Firm','value'])['Zipcode']
.unstack(level=1)
.reindex(df[['FirstYear','LastYear']].stack().unique(),axis=1))

Output:

value   2020   2021   2019   2022   2018
Firm                                    
A      00000  00000    NaN    NaN    NaN
B      00001  00001  00001  00001    NaN
C        NaN    NaN  00003    NaN  00003

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM