suppose I have 2 dataframes of the same shape. In the first column of each dataframe I have a product id (parent of variant items), in the following columns I have some data (pre-processed product features / numbers), which is different in the two dataframes, and in the last column I have the total amount of variant items pro product (children to parent product id)
first dataframe:
dfaa = pd.DataFrame([['id1', 1, 2, 3, 3], ['id2',4, 5, 6,6 ], ['id3', 7, 8, 9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])
second dataframe:
dfbb = pd.DataFrame([['id1', 1.1, 2.2, 3.3, 3], ['id2',4.4, 5.4, 6.6,6 ], ['id3', 7.7, 8.8, 9.9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])
What I need to do is join these dataframes to form one dataframe with a multiindex like this:
The first option would be an extra index level for each feature consisting of two columns on the lower level for the two values from 2 original dataframes. The second option I could think of is just concatenating the features along columns and then adding an extra index level which describes the numbers (non-NaN-values and unique values).
For the first option it could be neccessary to modify the names of the columns of the lower index level (eg instead of a and a I could work with a_vals and a_unique) - that would be no problem.
Trying real hard to get hold of working with data in python, I really appreciate your help.
Looking at one of your target structures, it can be built by stack()
and unstack()
dfaa = pd.DataFrame([['id1', 1, 2, 3, 3], ['id2',4, 5, 6,6 ], ['id3', 7, 8, 9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])
dfbb = pd.DataFrame([['id1', 1.1, 2.2, 3.3, 3], ['id2',4.4, 5.4, 6.6,6 ], ['id3', 7.7, 8.8, 9.9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])
def prepdf(df, cat):
return (df.loc[:,[c for c in dfaa.columns if "number" not in c]]
.set_index("prod_id")
.stack()
.to_frame()
.assign(cat=cat)
)
dfm = (pd.concat([
prepdf(dfaa, "VALS"),
prepdf(dfbb, "V_UN")])
.set_index("cat", append=True)
.unstack([2,1])
.droplevel(0, axis=1)
.join(dfbb.loc[:,["prod_id","number of prod variants"]]
.set_index("prod_id")
.rename(columns={"number of prod variants":("","number of prod variants")}))
)
cat VALS V_UN
a b c a b c number of prod variants
prod_id
id1 1.0 2.0 3.0 1.1 2.2 3.3 3
id2 4.0 5.0 6.0 4.4 5.4 6.6 6
id3 7.0 8.0 9.0 7.7 8.8 9.9 9
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.