简体   繁体   English

聚合pandas数据帧中的单元格/列

[英]Aggregating cells/column in pandas dataframe

I have a dataframe that is like this 我有一个像这样的数据帧

Index Z1       Z2       Z3       Z4  
 0    A(Z1W1)  A(Z2W1)  A(Z3W1) B(Z4W2)   
 1    A(Z1W3)  B(Z2W1)  A(Z3W2) B(Z4W3)   
 2    B(Z1W1)           A(Z3W4) B(Z4W4)
 3    B(Z1W2)

I want to convert it to 我想把它转换成

Index   Z1              Z2        Z3                    Z4
 0      A(Z1W1,Z1W3)    A(Z2W1)   A(Z3W1,Z3W2,Z3W4)     B(Z4W2,Z4W3,Z4W4)    
 1      B(Z1W1,Z1W2)    B(Z2W1)     

Basically I want to aggregate the values of different cell to one cell as shown above 基本上我想将不同单元格的值聚合到一个单元格,如上所示

Edit 1 编辑1

Actual column names are either two words or 3 words names and not AB For example Nut Butter instead of A 实际列名称是两个单词或3个单词的名称而不是AB例如Nut Butter而不是A.

Things are getting interested : -) 事情越来越感兴趣: - )

s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
v=('('+s.groupby([s.index.get_level_values(1),s[0]])[1].apply(','.join)+')').unstack().apply(lambda x : x.name+x.astype(str)).T
v[~v.apply(lambda x : x.str.contains('None'))].apply(lambda x : sorted(x,key=pd.isnull)).reset_index(drop=True)
Out[1865]: 
             Z1       Z2                 Z3                 Z4
0  A(Z1W1,Z1W3)  A(Z2W1)  A(Z3W1,Z3W2,Z3W4)  B(Z4W2,Z4W3,Z4W4)
1  B(Z1W1,Z1W2)  B(Z2W1)                NaN                NaN

Update Change 更新变更

#s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)

to

s=df.stack().str.split('(',expand=True)
s[1]=s[1].replace({'[(|)]':' '},regex=True).str.strip()

Geneal idea: 基本想法:

  1. split string values 拆分字符串值
  2. regroup and join stings 重组并加入蜇伤
  3. apply to all columns 适用于所有列

Update 1 更新1

# I had to add parameter as_index=False to groupby(0) 
# to get exactly same output as asked

Lets try one column 让我们尝试一栏

def str_regroup(s):
return s.str.extract(r"(\w)\((.+)\)",expand=True).groupby(0,as_index=False).apply(
    lambda x: '{}({})'.format(x.name,', '.join(x[1])))
str_regroup(df1.Z1)

output 产量

A   A(Z1W1, Z1W3)
B   B(Z1W1, Z1W2)

then apply to all columns 然后适用于所有列

df.apply(str_regroup)

output 产量

    Z1  Z2  Z3  Z4
0   A(Z1W1, Z1W3)   A(Z2W1) A(Z3W1, Z3W2, Z3W4) B(Z4W2, Z4W3, Z4W4)
1   B(Z1W1, Z1W2)   B(Z2W1)     

Update 2 更新2
Performance on 100 000 sample rows 性能上100个000样本行

  • 928 ms for this apply version ;b apply版本为928 ms ; b
  • 1.55 s for stack() by @Wen @Wen为stack() 1.55秒

You could use the following approach: 您可以使用以下方法:

  • Melt df to get: 熔化 df得到:

     In [194]: melted = pd.melt(df, var_name='col'); melted Out[194]: col value 0 Z1 A(Z1W1) 1 Z1 A(Z1W3) 2 Z1 B(Z1W1) 3 Z1 B(Z1W2) 4 Z2 A(Z2W1) 5 Z2 B(Z2W1) 6 Z2 7 Z2 8 Z3 A(Z3W1) 9 Z3 A(Z3W2) 10 Z3 A(Z3W4) 11 Z3 12 Z4 B(Z4W2) 13 Z4 B(Z4W3) 14 Z4 B(Z4W4) 15 Z4 
  • Use regex to extract row and value columns: 使用正则表达式提取rowvalue列:

     In [195]: melted[['row','value']] = melted['value'].str.extract(r'(.*)\\((.*)\\)', expand=True); melted Out[195]: col value row 0 Z1 Z1W1 A 1 Z1 Z1W3 A 2 Z1 Z1W1 B 3 Z1 Z1W2 B 4 Z2 Z2W1 A 5 Z2 Z2W1 B 6 Z2 NaN NaN 7 Z2 NaN NaN 8 Z3 Z3W1 A 9 Z3 Z3W2 A 10 Z3 Z3W4 A 11 Z3 NaN NaN 12 Z4 Z4W2 B 13 Z4 Z4W3 B 14 Z4 Z4W4 B 15 Z4 NaN NaN 
  • Group by col and row and join the value s together: colrow分组并将value s连接在一起:

     In [185]: result = melted.groupby(['col', 'row'])['value'].agg(','.join) In [186]: result Out[186]: col row Z1 A Z1W1,Z1W3 B Z1W1,Z1W2 Z2 A Z2W1 B Z2W1 Z3 A Z3W1,Z3W2,Z3W4 Z4 B Z4W2,Z4W3,Z4W4 Name: value, dtype: object 
  • Add the row values to the value values: row值添加到value值:

     In [188]: result['value'] = result['row'] + '(' + result['value'] + ')' In [189]: result Out[189]: row value col Z1 AA(Z1W1,Z1W3) Z1 BB(Z1W1,Z1W2) Z2 AA(Z2W1) Z2 BB(Z2W1) Z3 AA(Z3W1,Z3W2,Z3W4) Z4 BB(Z4W2,Z4W3,Z4W4) 
  • Overwrite the row column values with groupby/cumcount values to setup the upcoming pivot: 覆盖的row与列值groupby/cumcount值设置即将到来的支点:

     In [191]: result['row'] = result.groupby(level='col').cumcount() In [192]: result Out[192]: row value col Z1 0 A(Z1W1,Z1W3) Z1 1 B(Z1W1,Z1W2) Z2 0 A(Z2W1) Z2 1 B(Z2W1) Z3 0 A(Z3W1,Z3W2,Z3W4) Z4 0 B(Z4W2,Z4W3,Z4W4) 
  • Pivoting produces the desired result: 透视产生所需的结果:

     result = result.pivot(index='row', columns='col', values='value') 

import pandas as pd
df = pd.DataFrame({
 'Z1': ['A(Z1W1)', 'A(Z1W3)', 'B(Z1W1)', 'B(Z1W2)'],
 'Z2': ['A(Z2W1)', 'B(Z2W1)', '', ''],
 'Z3': ['A(Z3W1)', 'A(Z3W2)', 'A(Z3W4)', ''],
 'Z4': ['B(Z4W2)', 'B(Z4W3)', 'B(Z4W4)', '']}, index=[0, 1, 2, 3],)

melted = pd.melt(df, var_name='col').dropna()
melted[['row','value']] = melted['value'].str.extract(r'(.*)\((.*)\)', expand=True)
result = melted.groupby(['col', 'row'])['value'].agg(','.join)
result = result.reset_index('row')
result['value'] = result['row'] + '(' + result['value'] + ')'
result['row'] = result.groupby(level='col').cumcount()
result = result.reset_index()
result = result.pivot(index='row', columns='col', values='value')
print(result)

yields 产量

col            Z1       Z2                 Z3                 Z4
row                                                             
0    A(Z1W1,Z1W3)  A(Z2W1)  A(Z3W1,Z3W2,Z3W4)  B(Z4W2,Z4W3,Z4W4)
1    B(Z1W1,Z1W2)  B(Z2W1)                NaN                NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM