I'd like to aggregate rows in certain column base on the relationship with other column and create certain column which contain aggregated data in json format.
This is the example.
Original data table
Child Name Child Age Father Name Father Age
Peter 5 Richard 40
James 15 Doug 45
Liz 2 Doug 45
Paul 6 Richard 40
Shirly 11 Charles 33
Eva 9 Chris 29
Converted Data table will be either
Father Name Father Age Children
Richard 40 {"Peter":"5", "Paul":"6"}
Doug 45 {"James":"15","Liz":"2"}
Charles 33 {"Shirly" : "11"}
Chris 29 {"Eva" : "9"}
Or
Father Name Father Age Children Name Children Age
Richard 40 {"Peter", "Paul"} {"5","6"}
Doug 45 {"James", "Liz"} {"15","2"}
Charles 33 {"Shirly"} {"11"}
Chris 29 {"Eva"} {"9"}
My code is
import pandas as pd
df = pd.DataFrame({
"Child Name" : ["Peter","James","Liz","Paul","Shirly","Eva"],
"Child Age" : ["5","15","2","6","11","9"],
"Father Name" : ["Richard","Doug","Doug","Richard","Charles","Chris"],
"Father Age" : ["40","45","45","40","33","29"] })
print df
g1 = df.groupby(["Father Name"])["Child Name"].apply(", ".join).reset_index()
g1.columns = ['Father Name','Children Name']
print g1
and the output will be
Father Name Children Name
0 Charles Shirly
1 Chris Eva
2 Doug James, Liz
3 Richard Peter, Paul
I can't figure out how to add "Father Age" and "Children Age" in the columns. how can I convert this in dataframe in most efficient way? I'd like to avoid loop via python as it will take long to process.
thanks,
Quick dirty inefficient hack, but it avoids for loops. Would love to have a better solution; I assume the multiple df copies and multiple merges could be simplified.
import pandas as pd
df = pd.DataFrame({
"Child Name" : ["Peter","James","Liz","Paul","Shirly","Eva"],
"Child Age" : ["5","15","2","6","11","9"],
"Father Name" : ["Richard","Doug","Doug","Richard","Charles","Chris"],
"Father Age" : ["40","45","45","40","33","29"] })
g2 = df.groupby(['Father Name'])["Child Name"].apply(list).reset_index()
g3 = df.groupby(['Father Name'])["Child Age"].apply(list).reset_index()
g4 = df[["Father Name", "Father Age"]].drop_duplicates()
df2 = g2.merge(g4)
df2 = df2.merge(g3)
print(df2)
Output:
Father Name Child Name Father Age Child Age
0 Charles [Shirly] 33 [11]
1 Chris [Eva] 29 [9]
2 Doug [James, Liz] 45 [15, 2]
3 Richard [Peter, Paul] 40 [5, 6]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.