简体   繁体   中英

join dataframes and create a multiindex

suppose I have 2 dataframes of the same shape. In the first column of each dataframe I have a product id (parent of variant items), in the following columns I have some data (pre-processed product features / numbers), which is different in the two dataframes, and in the last column I have the total amount of variant items pro product (children to parent product id)

first dataframe:

dfaa = pd.DataFrame([['id1', 1, 2, 3, 3], ['id2',4, 5, 6,6 ], ['id3', 7, 8, 9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])

second dataframe:

dfbb = pd.DataFrame([['id1', 1.1, 2.2, 3.3, 3], ['id2',4.4, 5.4, 6.6,6 ], ['id3', 7.7, 8.8, 9.9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])

原始数据框

What I need to do is join these dataframes to form one dataframe with a multiindex like this:

具有多索引的数据帧

The first option would be an extra index level for each feature consisting of two columns on the lower level for the two values from 2 original dataframes. The second option I could think of is just concatenating the features along columns and then adding an extra index level which describes the numbers (non-NaN-values and unique values).

For the first option it could be neccessary to modify the names of the columns of the lower index level (eg instead of a and a I could work with a_vals and a_unique) - that would be no problem.

Trying real hard to get hold of working with data in python, I really appreciate your help.

Looking at one of your target structures, it can be built by stack() and unstack()

dfaa = pd.DataFrame([['id1', 1, 2, 3, 3], ['id2',4, 5, 6,6 ], ['id3', 7, 8, 9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])
dfbb = pd.DataFrame([['id1', 1.1, 2.2, 3.3, 3], ['id2',4.4, 5.4, 6.6,6 ], ['id3', 7.7, 8.8, 9.9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])

def prepdf(df, cat):
    return (df.loc[:,[c for c in dfaa.columns if "number" not in c]]
     .set_index("prod_id")
     .stack()
     .to_frame()
     .assign(cat=cat)
    )

dfm = (pd.concat([
    prepdf(dfaa, "VALS"),
    prepdf(dfbb, "V_UN")])
 .set_index("cat", append=True)
 .unstack([2,1])
 .droplevel(0, axis=1)
 .join(dfbb.loc[:,["prod_id","number of prod variants"]]
       .set_index("prod_id")
      .rename(columns={"number of prod variants":("","number of prod variants")}))
)

output

cat     VALS           V_UN                                  
           a    b    c    a    b    c number of prod variants
prod_id                                                      
id1      1.0  2.0  3.0  1.1  2.2  3.3                       3
id2      4.0  5.0  6.0  4.4  5.4  6.6                       6
id3      7.0  8.0  9.0  7.7  8.8  9.9                       9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM