簡體   English   中英

使用groupby / aggregate返回多列

[英]using groupby/aggregate to return multiple columns

我有一個示例數據集,我想對一列進行分組,然后根據現有列的所有值生成4個新列。

以下是一些示例數據:

data = {'AlignmentId': {0: u'ENSMUST00000000001.4-1',
  1: u'ENSMUST00000000001.4-1',
  2: u'ENSMUST00000000003.13-0',
  3: u'ENSMUST00000000003.13-0',
  4: u'ENSMUST00000000003.13-0'},
 'name': {0: u'NonCodingDeletion',
  1: u'NonCodingInsertion',
  2: u'CodingDeletion',
  3: u'CodingInsertion',
  4: u'NonCodingDeletion'},
 'value_CDS': {0: nan, 1: nan, 2: 1.0, 3: 1.0, 4: nan},
 'value_mRNA': {0: 21.0, 1: 26.0, 2: 1.0, 3: 1.0, 4: 2.0}}
df = pd.DataFrame.from_dict(data)

看起來像這樣:

               AlignmentId                name  value_mRNA  value_CDS
0   ENSMUST00000000001.4-1   NonCodingDeletion        21.0        NaN
1   ENSMUST00000000001.4-1  NonCodingInsertion        26.0        NaN
2  ENSMUST00000000003.13-0      CodingDeletion         1.0        1.0
3  ENSMUST00000000003.13-0     CodingInsertion         1.0        1.0
4  ENSMUST00000000003.13-0   NonCodingDeletion         2.0        NaN

我想根據value_CDS是否value_CDS包含空值,根據name欄中是否存在值返回布爾值。 我制作此函數是為了這樣做:

def aggfunc(s):
    if s.value_CDS.any():
        c = set(s.name)
    else:
        c = set(s.name)
    return ('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)

並這樣做:

merged = df.groupby('AlignmentId').aggregate(aggfunc)

這給了我錯誤ValueError: Shape of passed values is (318, 4), indices imply (318, 3)

如何從groupby-aggregate返回多個新列?

我正在尋找的輸出是:

ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)

然后,我理想地將其放入5列數據框。

如果我使用.apply ,則輸出不正確:

ENSMUST00000000001.4-1     (False, False, False, False)
ENSMUST00000000003.13-0    (False, False, False, False)

但是,如果我一次抓住一組,那是正確的:

In [380]: for aln_id, d in df.groupby('AlignmentId'):
   .....:     print aggfunc(d)
   .....:
(False, False, False, False)
(True, True, True, False)

您需要將name更改為['name'] ,因為.name返回組的名稱(列分組依據的值):

def aggfunc(s):
    if s.value_CDS.any():
        c = set(s['name'])
    else:
        c = set(s['name'])

    return ('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)

merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
AlignmentId
ENSMUST00000000001.4-1     (False, False, False, False)
ENSMUST00000000003.13-0       (True, True, True, False)
dtype: object

def aggfunc(s):

    print ('Name of group is: {}'.format((s.name)))  
    print ('Column name is:\n {}'.format(s['name']))  


merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)

Name of group is: ENSMUST00000000001.4-1
Column name is:
 0     NonCodingDeletion
1    NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000001.4-1
Column name is:
 0     NonCodingDeletion
1    NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000003.13-0
Column name is:
 2       CodingDeletion
3      CodingInsertion
4    NonCodingDeletion
Name: name, dtype: object

改進的代碼:

def aggfunc(s):
    #if and else return same c, so omitted
    c = set(s['name'])

    #added Series for return columns instead tuples
    cols = ['col1','col2','col3','col4']
    return pd.Series(('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c), index=cols)

merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)

                          col1   col2   col3   col4
AlignmentId                                        
ENSMUST00000000001.4-1   False  False  False  False
ENSMUST00000000003.13-0   True   True   True  False

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM