[英]Aggregating cells/column in pandas dataframe
I have a dataframe that is like this 我有一个像这样的数据帧
Index Z1 Z2 Z3 Z4
0 A(Z1W1) A(Z2W1) A(Z3W1) B(Z4W2)
1 A(Z1W3) B(Z2W1) A(Z3W2) B(Z4W3)
2 B(Z1W1) A(Z3W4) B(Z4W4)
3 B(Z1W2)
I want to convert it to 我想把它转换成
Index Z1 Z2 Z3 Z4
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1)
Basically I want to aggregate the values of different cell to one cell as shown above 基本上我想将不同单元格的值聚合到一个单元格,如上所示
Edit 1 编辑1
Actual column names are either two words or 3 words names and not AB For example Nut Butter instead of A 实际列名称是两个单词或3个单词的名称而不是AB例如Nut Butter而不是A.
Things are getting interested : -) 事情越来越感兴趣: - )
s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
v=('('+s.groupby([s.index.get_level_values(1),s[0]])[1].apply(','.join)+')').unstack().apply(lambda x : x.name+x.astype(str)).T
v[~v.apply(lambda x : x.str.contains('None'))].apply(lambda x : sorted(x,key=pd.isnull)).reset_index(drop=True)
Out[1865]:
Z1 Z2 Z3 Z4
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1) NaN NaN
Update Change 更新变更
#s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
to 至
s=df.stack().str.split('(',expand=True)
s[1]=s[1].replace({'[(|)]':' '},regex=True).str.strip()
Geneal idea: 基本想法:
Update 1 更新1
# I had to add parameter as_index=False to groupby(0)
# to get exactly same output as asked
Lets try one column 让我们尝试一栏
def str_regroup(s):
return s.str.extract(r"(\w)\((.+)\)",expand=True).groupby(0,as_index=False).apply(
lambda x: '{}({})'.format(x.name,', '.join(x[1])))
str_regroup(df1.Z1)
output 产量
A A(Z1W1, Z1W3)
B B(Z1W1, Z1W2)
then apply to all columns 然后适用于所有列
df.apply(str_regroup)
output 产量
Z1 Z2 Z3 Z4
0 A(Z1W1, Z1W3) A(Z2W1) A(Z3W1, Z3W2, Z3W4) B(Z4W2, Z4W3, Z4W4)
1 B(Z1W1, Z1W2) B(Z2W1)
Update 2 更新2
Performance on 100 000 sample rows 性能上100个000样本行
apply
version ;b 此apply
版本为928 ms ; b stack()
by @Wen @Wen为stack()
1.55秒 You could use the following approach: 您可以使用以下方法:
In [194]: melted = pd.melt(df, var_name='col'); melted Out[194]: col value 0 Z1 A(Z1W1) 1 Z1 A(Z1W3) 2 Z1 B(Z1W1) 3 Z1 B(Z1W2) 4 Z2 A(Z2W1) 5 Z2 B(Z2W1) 6 Z2 7 Z2 8 Z3 A(Z3W1) 9 Z3 A(Z3W2) 10 Z3 A(Z3W4) 11 Z3 12 Z4 B(Z4W2) 13 Z4 B(Z4W3) 14 Z4 B(Z4W4) 15 Z4
Use regex to extract row
and value
columns: 使用正则表达式提取row
和value
列:
In [195]: melted[['row','value']] = melted['value'].str.extract(r'(.*)\\((.*)\\)', expand=True); melted Out[195]: col value row 0 Z1 Z1W1 A 1 Z1 Z1W3 A 2 Z1 Z1W1 B 3 Z1 Z1W2 B 4 Z2 Z2W1 A 5 Z2 Z2W1 B 6 Z2 NaN NaN 7 Z2 NaN NaN 8 Z3 Z3W1 A 9 Z3 Z3W2 A 10 Z3 Z3W4 A 11 Z3 NaN NaN 12 Z4 Z4W2 B 13 Z4 Z4W3 B 14 Z4 Z4W4 B 15 Z4 NaN NaN
Group by col
and row
and join the value
s together: 按col
和row
分组并将value
s连接在一起:
In [185]: result = melted.groupby(['col', 'row'])['value'].agg(','.join) In [186]: result Out[186]: col row Z1 A Z1W1,Z1W3 B Z1W1,Z1W2 Z2 A Z2W1 B Z2W1 Z3 A Z3W1,Z3W2,Z3W4 Z4 B Z4W2,Z4W3,Z4W4 Name: value, dtype: object
Add the row
values to the value
values: 将row
值添加到value
值:
In [188]: result['value'] = result['row'] + '(' + result['value'] + ')' In [189]: result Out[189]: row value col Z1 AA(Z1W1,Z1W3) Z1 BB(Z1W1,Z1W2) Z2 AA(Z2W1) Z2 BB(Z2W1) Z3 AA(Z3W1,Z3W2,Z3W4) Z4 BB(Z4W2,Z4W3,Z4W4)
Overwrite the row
column values with groupby/cumcount
values to setup the upcoming pivot: 覆盖的row
与列值groupby/cumcount
值设置即将到来的支点:
In [191]: result['row'] = result.groupby(level='col').cumcount() In [192]: result Out[192]: row value col Z1 0 A(Z1W1,Z1W3) Z1 1 B(Z1W1,Z1W2) Z2 0 A(Z2W1) Z2 1 B(Z2W1) Z3 0 A(Z3W1,Z3W2,Z3W4) Z4 0 B(Z4W2,Z4W3,Z4W4)
Pivoting produces the desired result: 透视产生所需的结果:
result = result.pivot(index='row', columns='col', values='value')
import pandas as pd
df = pd.DataFrame({
'Z1': ['A(Z1W1)', 'A(Z1W3)', 'B(Z1W1)', 'B(Z1W2)'],
'Z2': ['A(Z2W1)', 'B(Z2W1)', '', ''],
'Z3': ['A(Z3W1)', 'A(Z3W2)', 'A(Z3W4)', ''],
'Z4': ['B(Z4W2)', 'B(Z4W3)', 'B(Z4W4)', '']}, index=[0, 1, 2, 3],)
melted = pd.melt(df, var_name='col').dropna()
melted[['row','value']] = melted['value'].str.extract(r'(.*)\((.*)\)', expand=True)
result = melted.groupby(['col', 'row'])['value'].agg(','.join)
result = result.reset_index('row')
result['value'] = result['row'] + '(' + result['value'] + ')'
result['row'] = result.groupby(level='col').cumcount()
result = result.reset_index()
result = result.pivot(index='row', columns='col', values='value')
print(result)
yields 产量
col Z1 Z2 Z3 Z4
row
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1) NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.