简体   繁体   中英

How to split single column of pandas dataframe into multiple columns with group?

I am new to python pandas. I have one dataframe like below:

df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
               'age': ['25', '22','21','32','37','26','24','30']})
print df

       Name age
0  football  25
1    ramesh  22
2    suresh  21
3    pankaj  32
4   cricket  37
5    rakesh  26
6     mohit  24
7    mahesh  30

"Name" column contains "sports name" and "sport person name" also. I want to split it into two different columns like below:

Expected Output:

sports_name sport_person_name age
football    ramesh            25
            suresh            22
            pankaj            32
cricket     rakesh            26
            mohit             24
            mahesh            30

If I make groupby on "Name" column I'm not getting expected output and it is obviously straight-forward output because no duplicates in "Name" column. What I need to use so that I can get expected output?

Edit : If don't want to hardcode the sports names

df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
           'age': ['', '22','21','32','','26','24','30']})

df = df.replace('', np.nan, regex=True)

nan_rows = df[df.isnull().T.any().T]
sports = nan_rows['Name'].tolist()

df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)

I Just Checked for except "Name" column which rows contains NAN values in all rest of the columns and It will be definitely sports names. I created list of that sports names and make use of below solutions to create sports_name and sports_person_name columns.

You can use:

#define list of sports
sports = ['football','cricket']
#create NaNs if no sport in Name, forward filling NaNs
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
#remove same values in columns sports_name and Name, rename column
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
#change order of columns
df = df[['sports_name','sport_person_name','age']]
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1    football            suresh  21
2    football            pankaj  32
3     cricket            rakesh  26
4     cricket             mohit  24
5     cricket            mahesh  30

Similar solution with DataFrame.insert - then reorder is not necessary:

#define list of sports
sports = ['football','cricket']
#rename column by dict
d = {'Name':'sport_person_name'}
df = df.rename(columns=d)
#create NaNs if no sport in Name, forward filling NaNs
df.insert(0, 'sports_name', df['sport_person_name'].where(df['sport_person_name'].isin(sports)).ffill())
#remove same values in columns sports_name and Name
df = df[df['sports_name'] != df['sport_person_name']].reset_index(drop=True)
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1    football            suresh  21
2    football            pankaj  32
3     cricket            rakesh  26
4     cricket             mohit  24
5     cricket            mahesh  30

If want only one value of sport add limit=1 to ffill and replace NaN s to empty string:

sports = ['football','cricket']
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill(limit=1).fillna('')
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1                        suresh  21
2                        pankaj  32
3     cricket            rakesh  26
4                         mohit  24
5                        mahesh  30

The output you want is a dictionary and not a dataframe. The dictionary will look:

{'Sport' : {'Player' : age,'Player2' : age}}

If you really want a dataframe: If the name always comes before the players:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Name': ['football','ramesh','suresh','pankaj','cricket' 
                  ,'rakesh','mohit','mahesh'],
                  'age': ['25', '22','21','32','37','26','24','30']})

sports=['football', 'cricket']
wanted_dict={}
current_sport=''

for val in df['sport_person_name']:
    if val in sports:
        current_sport=val
    else:
        wanted_dict[val]=current_sport

#Now you got - {name:sport_name,...}

df['sports_name']=999
for val in df['sport_person_name']
    df['sports_name']=np.where((val not in sports)&
                              (df['sport_person_name']==val),
                               wanted_dict[val],'sport)

df = df[df['sports_name']!='sport']

What it should look like:

sports_name sport_person_name age
football    ramesh            25
football    suresh            22
football    pankaj            32
cricket     rakesh            26
cricket     mohit             24
cricket     mahesh            30

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM