简体   繁体   中英

Expanding a column of pandas dataframe with variable number of elements and leading texts

I am trying to expand a column of a pandas dataframe (see column segments in example below.) I am able to break it out into the components seperated by; However, as you can see, some of the rows in the columns do not have all the elements. So, what is happening is that the data which should go into the Geo column ends up going into the BusSeg column, since there was no Geo column; or the data that should be in ProdServ column ends up in the Geo column. Ideally I would like to have only the data and not the indicator in each cell correctly placed. So, In the Geo column it should say 'NonUs'. Not 'Geo=NonUs.' That is after seperating correctly, I would like to remove the text upto and including the '=' sign in each. How can I do this? Code below:

import pandas as pd

company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
                    'Subseg=Tr1',
                    'BusSeg=Pharma',
                    'Geo=China;Prd=Alpha;Subseg=Tr4;',
                    'Prd=Beta;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
                    'BusSeg=Pharma;Geo=NonUs;']
print("\ndf1:")
df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True)
print(df1)
print(df1[['BusSeg','Geo','ProdServ','Sub','Misc']])
print(df1.dtypes)
print()

Your Data

import pandas as pd

company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
                    'Subseg=Tr1',
                    'BusSeg=Pharma',
                    'Geo=China;Prd=Alpha;Subseg=Tr4;',
                    'Prd=Beta;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
                    'BusSeg=Pharma;Geo=NonUs;']

df:


    company     clv     date    line    segments
0   Rev     500     20191231    1   BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1
1   Rev     200     20191231    3   BusSeg=Dev;Prd=Alpha;Subseg=Tr1
2   Rev     3000    20191231    2   BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2
3   Rev     400     20181231    1   Subseg=Tr1
4   Rev     10      20181231    3   BusSeg=Pharma
5   Rev     300     20181231    2   Geo=China;Prd=Alpha;Subseg=Tr4;
6   Rev     560     20171231    1   Prd=Beta;Subseg=Tr1
7   Rev     500     20171231    3   BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;
8   Rev     600     20171231    2   BusSeg=Pharma;Geo=NonUs;

Comment this line df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True) in your code, and add theese two lines

d = pd.DataFrame(df1['segments'].str.split(';').apply(lambda x:{i.split("=")[0] : i.split("=")[1] for i in x if i}).to_dict()).T
df = pd.concat([df1, d], axis=1)

df:

  company   clv      date  line                                      segments  BusSeg    Geo    Prd Subseg
0     Rev   500  20191231     1  BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1  Pharma  NonUs  Alpha    Tr1
1     Rev   200  20191231     3               BusSeg=Dev;Prd=Alpha;Subseg=Tr1     Dev    NaN  Alpha    Tr1
2     Rev  3000  20191231     2     BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2  Pharma     US  Alpha    Tr2
3     Rev   400  20181231     1                                    Subseg=Tr1     NaN    NaN    NaN    Tr1
4     Rev    10  20181231     3                                 BusSeg=Pharma  Pharma    NaN    NaN    NaN
5     Rev   300  20181231     2               Geo=China;Prd=Alpha;Subseg=Tr4;     NaN  China  Alpha    Tr4
6     Rev   560  20171231     1                           Prd=Beta;Subseg=Tr1     NaN    NaN   Beta    Tr1
7     Rev   500  20171231     3    BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;  Pharma     US  Delta    Tr1
8     Rev   600  20171231     2                      BusSeg=Pharma;Geo=NonUs;  Pharma  NonUs    NaN    NaN

I sugest, to fill the columns one by one instead of using split, something like the followin code:

col = ['BusSeg', 'Geo', 'ProdServ', 'Sub'] # Columns Names.
var = ['BusSeg', 'Geo', 'Prd', 'Subseg'] # Variables Name in 'Subseg' column.
for c, v in zip(col, var):
    df1[c] = df1['segments'].str.extract(r'' + v + '=(\w*);')

Here's a suggestion:

df1.segments = (df1.segments.str.split(';')
                   .apply(lambda s:
                          dict(t.split('=') for t in s if t.strip() != '')))
df2 = pd.DataFrame({col: [dict_.get(col, '') for dict_ in df1.segments]
                    for col in set().union(*df1.segments)},
                   index=df1.index)
df1.drop(columns=['segments'], inplace=True)
df1 = pd.concat([df1, df2], axis='columns')

Result:

  company   clv      date  line Subseg    Geo  BusSeg    Prd
0     Rev   500  20191231     1    Tr1  NonUs  Pharma  Alpha
1     Rev   200  20191231     3    Tr1            Dev  Alpha
2     Rev  3000  20191231     2    Tr2     US  Pharma  Alpha
3     Rev   400  20181231     1    Tr1                      
4     Rev    10  20181231     3                Pharma       
5     Rev   300  20181231     2    Tr4  China          Alpha
6     Rev   560  20171231     1    Tr1                  Beta
7     Rev   500  20171231     3    Tr1     US  Pharma  Delta
8     Rev   600  20171231     2         NonUs  Pharma       

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM