[英]Imputing Values Based on FirstYear and LastYear in Long Table Format
I have a long table on firm-level that has the first and last active year and their zip code.我有一张关于公司级别的长表,其中包含第一个和最后一个活跃年份及其 zip 代码。
pd.DataFrame({'Firm':['A','B','C'],
'FirstYear':[2020, 2019, 2018],
'LastYear':[2021, 2022, 2019],
'Zipcode':['00000','00001','00003']})
Firm FirstYear LastYear Zipcode
A 2020 2021 00000
B 2019 2022 00001
C 2018 2019 00003
I want to get the panel data that has the zipcode for every active year.我想获取包含每个活跃年份的邮政编码的面板数据。 So ideally I might want a wide table that impute the value of Zipcode based on first year and last year, and every year between the first and last year .
所以理想情况下,我可能想要一个宽表,根据第一年和去年以及第一年和最后一年之间的每一年来估算 Zipcode 的值。
It should look like this:它应该是这样的:
2020 2021 2019 2022 2018
A 00000 00000
B 00001 00001 00001 00001
C 00003 00003
I have some code to create a long table per row but I have many millions of rows and it takes a long time.我有一些代码可以为每行创建一个长表,但我有数百万行并且需要很长时间。 What's the best way in terms of performance and memory use to transform the long table I have to impute every year's zipcode value in pandas?
就性能而言,memory 用于转换长表的最佳方法是什么,我必须在 pandas 中估算每年的邮政编码值?
Thanks in advance.提前致谢。
Responding to the answer's update: Imagine there is a firm whose first and last year didn't overlap with other firms.回应答案的更新:假设有一家公司的第一年和最后一年没有与其他公司重叠。
df=pd.DataFrame({'Firm':['A','B','C'],
'FirstYear':[2020, 2019, 1997],
'LastYear':[2021, 2022, 2008],
'Zipcode':['00000','00001','00003']})
The output from the code is like:代码中的 output 是这样的:
Firm 2020 2021 2019 2022 1997 2008
A 00000 00000
B 00001 00001 00001 00001
C 00003 00003
Here is a solution with pd.melt()
这是
pd.melt()
的解决方案
d = (pd.melt(df,id_vars=['Firm','Zipcode'])
.set_index(['Firm','value'])['Zipcode']
.unstack(level=1))
d = (d.ffill(axis=1)
.where(d.ffill(axis=1).notna() &
d.bfill(axis=1).notna())
.reindex(df[['FirstYear','LastYear']].stack().unique(),axis=1))
Original Answer:原答案:
(pd.melt(df,id_vars=['Firm','Zipcode'])
.set_index(['Firm','value'])['Zipcode']
.unstack(level=1)
.reindex(df[['FirstYear','LastYear']].stack().unique(),axis=1))
Output: Output:
value 2020 2021 2019 2022 2018
Firm
A 00000 00000 NaN NaN NaN
B 00001 00001 00001 00001 NaN
C NaN NaN 00003 NaN 00003
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.