![](/img/trans.png)
[英]Create multiple columns with calculations using groupby, aggregate functions in Pandas
[英]using groupby/aggregate to return multiple columns
我有一個示例數據集,我想對一列進行分組,然后根據現有列的所有值生成4個新列。
以下是一些示例數據:
data = {'AlignmentId': {0: u'ENSMUST00000000001.4-1',
1: u'ENSMUST00000000001.4-1',
2: u'ENSMUST00000000003.13-0',
3: u'ENSMUST00000000003.13-0',
4: u'ENSMUST00000000003.13-0'},
'name': {0: u'NonCodingDeletion',
1: u'NonCodingInsertion',
2: u'CodingDeletion',
3: u'CodingInsertion',
4: u'NonCodingDeletion'},
'value_CDS': {0: nan, 1: nan, 2: 1.0, 3: 1.0, 4: nan},
'value_mRNA': {0: 21.0, 1: 26.0, 2: 1.0, 3: 1.0, 4: 2.0}}
df = pd.DataFrame.from_dict(data)
看起來像這樣:
AlignmentId name value_mRNA value_CDS
0 ENSMUST00000000001.4-1 NonCodingDeletion 21.0 NaN
1 ENSMUST00000000001.4-1 NonCodingInsertion 26.0 NaN
2 ENSMUST00000000003.13-0 CodingDeletion 1.0 1.0
3 ENSMUST00000000003.13-0 CodingInsertion 1.0 1.0
4 ENSMUST00000000003.13-0 NonCodingDeletion 2.0 NaN
我想根據value_CDS
是否value_CDS
包含空值,根據name
欄中是否存在值返回布爾值。 我制作此函數是為了這樣做:
def aggfunc(s):
if s.value_CDS.any():
c = set(s.name)
else:
c = set(s.name)
return ('CodingDeletion' in c or 'CodingInsertion' in c,
'CodingInsertion' in c, 'CodingDeletion' in c,
'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)
並這樣做:
merged = df.groupby('AlignmentId').aggregate(aggfunc)
這給了我錯誤ValueError: Shape of passed values is (318, 4), indices imply (318, 3)
。
如何從groupby-aggregate返回多個新列?
我正在尋找的輸出是:
ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)
然后,我理想地將其放入5列數據框。
如果我使用.apply
,則輸出不正確:
ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (False, False, False, False)
但是,如果我一次抓住一組,那是正確的:
In [380]: for aln_id, d in df.groupby('AlignmentId'):
.....: print aggfunc(d)
.....:
(False, False, False, False)
(True, True, True, False)
您需要將name
更改為['name']
,因為.name
返回組的名稱(列分組依據的值):
def aggfunc(s):
if s.value_CDS.any():
c = set(s['name'])
else:
c = set(s['name'])
return ('CodingDeletion' in c or 'CodingInsertion' in c,
'CodingInsertion' in c, 'CodingDeletion' in c,
'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)
merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
AlignmentId
ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)
dtype: object
def aggfunc(s):
print ('Name of group is: {}'.format((s.name)))
print ('Column name is:\n {}'.format(s['name']))
merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
Name of group is: ENSMUST00000000001.4-1
Column name is:
0 NonCodingDeletion
1 NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000001.4-1
Column name is:
0 NonCodingDeletion
1 NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000003.13-0
Column name is:
2 CodingDeletion
3 CodingInsertion
4 NonCodingDeletion
Name: name, dtype: object
改進的代碼:
def aggfunc(s):
#if and else return same c, so omitted
c = set(s['name'])
#added Series for return columns instead tuples
cols = ['col1','col2','col3','col4']
return pd.Series(('CodingDeletion' in c or 'CodingInsertion' in c,
'CodingInsertion' in c, 'CodingDeletion' in c,
'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c), index=cols)
merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
col1 col2 col3 col4
AlignmentId
ENSMUST00000000001.4-1 False False False False
ENSMUST00000000003.13-0 True True True False
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.