简体   繁体   中英

How to combine rows and put into single row in dataframe by sql or python

I'd like to aggregate rows in certain column base on the relationship with other column and create certain column which contain aggregated data in json format.

This is the example.

Original data table

Child Name     Child Age    Father Name    Father Age
     Peter             5        Richard            40
     James            15           Doug            45
       Liz             2           Doug            45
      Paul             6        Richard            40
    Shirly            11        Charles            33
       Eva             9          Chris            29

Converted Data table will be either

Father Name    Father Age     Children 
    Richard            40     {"Peter":"5", "Paul":"6"}
       Doug            45     {"James":"15","Liz":"2"}
    Charles            33     {"Shirly" : "11"}
      Chris            29     {"Eva" : "9"}

Or

Father Name    Father Age     Children Name       Children Age
    Richard            40     {"Peter", "Paul"}      {"5","6"}
       Doug            45     {"James", "Liz"}      {"15","2"}
    Charles            33     {"Shirly"}                {"11"}
      Chris            29     {"Eva"}                    {"9"}

My code is

import pandas as pd
df = pd.DataFrame({
    "Child Name" : ["Peter","James","Liz","Paul","Shirly","Eva"],
    "Child Age" : ["5","15","2","6","11","9"],
    "Father Name" : ["Richard","Doug","Doug","Richard","Charles","Chris"],
    "Father Age" : ["40","45","45","40","33","29"] })

 print df

g1 = df.groupby(["Father Name"])["Child Name"].apply(", ".join).reset_index()
g1.columns = ['Father Name','Children Name']
print g1

and the output will be

  Father Name   Children Name
0     Charles          Shirly
1       Chris             Eva
2        Doug      James, Liz
3     Richard     Peter, Paul

I can't figure out how to add "Father Age" and "Children Age" in the columns. how can I convert this in dataframe in most efficient way? I'd like to avoid loop via python as it will take long to process.

thanks,

Quick dirty inefficient hack, but it avoids for loops. Would love to have a better solution; I assume the multiple df copies and multiple merges could be simplified.

import pandas as pd
df = pd.DataFrame({
    "Child Name" : ["Peter","James","Liz","Paul","Shirly","Eva"],
    "Child Age" : ["5","15","2","6","11","9"],
    "Father Name" : ["Richard","Doug","Doug","Richard","Charles","Chris"],
    "Father Age" : ["40","45","45","40","33","29"] })

g2 = df.groupby(['Father Name'])["Child Name"].apply(list).reset_index()
g3 = df.groupby(['Father Name'])["Child Age"].apply(list).reset_index()
g4 = df[["Father Name", "Father Age"]].drop_duplicates()

df2 = g2.merge(g4)
df2 = df2.merge(g3)
print(df2)

Output:

  Father Name     Child Name Father Age Child Age
0     Charles       [Shirly]         33      [11]
1       Chris          [Eva]         29       [9]
2        Doug   [James, Liz]         45   [15, 2]
3     Richard  [Peter, Paul]         40    [5, 6]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM