简体   繁体   中英

How to reverse a multi index pivot table in python

I have a dataframe that I convert to a pivot table, perform some imputation for missing data and then convert it back to the original form. The code I have appears to work in that it does not produce errors, but the output does not yield the expected number of rows. I suspect the problem is something to do with specifying the melting/stacking, but dont quite know what. I would be very grateful if someone was able to provide some help/support. Pictures, code and further info are below.

Thankyou in advance to anyone who helps.

The initial dataframe (data) contains 4 columns (geocode/country, variablename, year and value). There are 290,038 rows x 4 columns.

在此处输入图片说明

I convert data into the following form (country year pairs in each row, with each column being a variable). using the following code

data_temp = data.copy()

data_temp_grouped = pd.pivot_table(data_temp, index=(['geocode','year']),columns="variablename",values="value")

在此处输入图片说明

After performing some operations/imputation, I want to convert data_temp_grouped back to the original form as data . I have tried a few different methods, code does not produce the expected number of rows (290,038) .

This produces 4 columns but 827,929 rows.

data_temp_grouped2 = data_temp_grouped.copy()

data_temp_grouped3 = data_temp_grouped2.stack(0).reset_index(name='value') 在此处输入图片说明

This produces 111,5712 rows x 4 columns

data_temp_grouped2 = data_temp_grouped2.copy()

data_temp_grouped4 = data_temp_grouped4.reset_index()

data_temp_grouped4 = pd.melt(data_temp_grouped4,id_vars=["geocode","year"])

data_temp_grouped4

TLDR: I failed to account for "missing" data in wide format that was "added" to long format.

I just realized why I was having these problems. In the initial long format, there were ~290,000 rows. When converted into a wide format, there are 7748 (rows) x144 (cols). When this is squished into a long format, there are a total of 1,115,712 rows (7748 x 144). This increase comes due to the fact that missing data (country year pairs for certain variables) was not present in the initial data and only "emerged" during the conversion to wide format. Recoverting it again from long to wide the dimensions match : 7748 x 144 as expected.

For anyone else who might encounter the same problem, I've also included my code below. The code is below

# grouping country year pairs
data_temp = data.copy()
# converts into multi indexed wide format (country year pairs)
data_temp_grouped = pd.pivot_table(data_temp, index=(['geocode','year']),columns="variablename",values="value")

# linearly  interpolates the data for each country year pair
data_temp_grouped=data_temp_grouped.groupby("geocode").apply(lambda x : x.interpolate(method="linear",limit_direction="both"))

# Make a copy of the dataframe
data_temp_grouped2 = data_temp_grouped.copy()

# reset the index
data_temp_grouped2=data_temp_grouped2.reset_index()
data_temp_grouped2_melted=pd.melt(data_temp_grouped2,id_vars=['geocode',"year"],var_name='variablename', value_name='value')
data_temp_grouped2_melted

# to double check and convert back to multi index wide format
data_temp_grouped_check = pd.pivot_table(data_temp_grouped2_melted,index=(['geocode','year']),columns="variablename",values="value")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM