簡體   English   中英

使用可變數量的元素和前導文本擴展一列 pandas dataframe

[英]Expanding a column of pandas dataframe with variable number of elements and leading texts

我正在嘗試擴展 pandas dataframe 的列(請參見下面示例中的列段。)我能夠將其分解為由分隔的組件; 但是,如您所見,列中的某些行並不包含所有元素。 所以,發生的事情是應該 go 進入 Geo 列的數據最終進入 BusSeg 列,因為沒有 Geo 列; 或者應該在 ProdServ 列中的數據最終在 Geo 列中。 理想情況下,我只想正確放置每個單元格中的數據而不是指標。 因此,在 Geo 列中應該顯示“NonUs”。 不是“Geo=NonUs”。 那是在正確分離之后,我想刪除文本,並在每個文本中包含“=”符號。 我怎樣才能做到這一點? 下面的代碼:

import pandas as pd

company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
                    'Subseg=Tr1',
                    'BusSeg=Pharma',
                    'Geo=China;Prd=Alpha;Subseg=Tr4;',
                    'Prd=Beta;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
                    'BusSeg=Pharma;Geo=NonUs;']
print("\ndf1:")
df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True)
print(df1)
print(df1[['BusSeg','Geo','ProdServ','Sub','Misc']])
print(df1.dtypes)
print()

您的數據

import pandas as pd

company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
                    'Subseg=Tr1',
                    'BusSeg=Pharma',
                    'Geo=China;Prd=Alpha;Subseg=Tr4;',
                    'Prd=Beta;Subseg=Tr1',
                    'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
                    'BusSeg=Pharma;Geo=NonUs;']

東風:


    company     clv     date    line    segments
0   Rev     500     20191231    1   BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1
1   Rev     200     20191231    3   BusSeg=Dev;Prd=Alpha;Subseg=Tr1
2   Rev     3000    20191231    2   BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2
3   Rev     400     20181231    1   Subseg=Tr1
4   Rev     10      20181231    3   BusSeg=Pharma
5   Rev     300     20181231    2   Geo=China;Prd=Alpha;Subseg=Tr4;
6   Rev     560     20171231    1   Prd=Beta;Subseg=Tr1
7   Rev     500     20171231    3   BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;
8   Rev     600     20171231    2   BusSeg=Pharma;Geo=NonUs;

在代碼中注釋此行df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True) , 並添加這兩行

d = pd.DataFrame(df1['segments'].str.split(';').apply(lambda x:{i.split("=")[0] : i.split("=")[1] for i in x if i}).to_dict()).T
df = pd.concat([df1, d], axis=1)

東風:

  company   clv      date  line                                      segments  BusSeg    Geo    Prd Subseg
0     Rev   500  20191231     1  BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1  Pharma  NonUs  Alpha    Tr1
1     Rev   200  20191231     3               BusSeg=Dev;Prd=Alpha;Subseg=Tr1     Dev    NaN  Alpha    Tr1
2     Rev  3000  20191231     2     BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2  Pharma     US  Alpha    Tr2
3     Rev   400  20181231     1                                    Subseg=Tr1     NaN    NaN    NaN    Tr1
4     Rev    10  20181231     3                                 BusSeg=Pharma  Pharma    NaN    NaN    NaN
5     Rev   300  20181231     2               Geo=China;Prd=Alpha;Subseg=Tr4;     NaN  China  Alpha    Tr4
6     Rev   560  20171231     1                           Prd=Beta;Subseg=Tr1     NaN    NaN   Beta    Tr1
7     Rev   500  20171231     3    BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;  Pharma     US  Delta    Tr1
8     Rev   600  20171231     2                      BusSeg=Pharma;Geo=NonUs;  Pharma  NonUs    NaN    NaN

我建議,一一填充列而不是使用拆分,類似於以下代碼:

col = ['BusSeg', 'Geo', 'ProdServ', 'Sub'] # Columns Names.
var = ['BusSeg', 'Geo', 'Prd', 'Subseg'] # Variables Name in 'Subseg' column.
for c, v in zip(col, var):
    df1[c] = df1['segments'].str.extract(r'' + v + '=(\w*);')

這里有一個建議:

df1.segments = (df1.segments.str.split(';')
                   .apply(lambda s:
                          dict(t.split('=') for t in s if t.strip() != '')))
df2 = pd.DataFrame({col: [dict_.get(col, '') for dict_ in df1.segments]
                    for col in set().union(*df1.segments)},
                   index=df1.index)
df1.drop(columns=['segments'], inplace=True)
df1 = pd.concat([df1, df2], axis='columns')

結果:

  company   clv      date  line Subseg    Geo  BusSeg    Prd
0     Rev   500  20191231     1    Tr1  NonUs  Pharma  Alpha
1     Rev   200  20191231     3    Tr1            Dev  Alpha
2     Rev  3000  20191231     2    Tr2     US  Pharma  Alpha
3     Rev   400  20181231     1    Tr1                      
4     Rev    10  20181231     3                Pharma       
5     Rev   300  20181231     2    Tr4  China          Alpha
6     Rev   560  20171231     1    Tr1                  Beta
7     Rev   500  20171231     3    Tr1     US  Pharma  Delta
8     Rev   600  20171231     2         NonUs  Pharma       

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM