简体   繁体   English

将自定义函数应用于 Pandas groupby 对象

[英]Apply custom function to pandas groupby object

df1 = pd.DataFrame({'Chromosome': ['1A','1A','1A','1A','1A'],
              'Marker': ['M1','M2','M3','M4','M5'],
             'Position': [0,1.2,3.5,6,7.3]})
df2 = pd.DataFrame({'Chromosome': ['1A','1A','1A','1A','1A','1B','1B','1B'],
              'Marker': ['M1','M2','M3','M4','M5','mk1','mk2','mk3'],
             'Position': [0,1.2,3.5,6,7.3,0,2.3,3.2]})
#Expected result for df1
#'1A 5 M1 1.2 M2 2.3 M3 2.5 M4 1.3 M5'

#Expected result for df2
#'1A 5 M1 1.2 M2 2.3 M3 2.5 M4 1.3 M5'
#'1B 3 mk1 2.3 mk2 0.9 mk3' 


#My function for computing intermarker distance
def position_interval(df):
    df.loc[:,'diffPos'] = round(df['Position'].diff(),1).shift(-1)

a = []
i = 0
while i < df.shape[0]:#omit the last index
    info = df['Marker'][i]+' '+str(round(df['diffPos'][i],1))
    #print(info)
    a.append(info)
    i +=1
#print(a)
a.insert(0,str(len(df['Marker'])))
a.insert(0,df['Chromosome'][0])
new_info = ' '.join(a).replace(' nan','')#removing the last ' nan'
#print(new_info)    
return new_info

Applying the function to df1 works perfectly:将该函数应用于 df1 效果很好:

position_interval(df1)

But I'm not sure how to apply to each grouby object:但我不确定如何应用于每个 grouby 对象:

position_interval(df2)

As the function need the 'Chromosome' key, you must place the as_index=False argument in groupby :由于该函数需要 'Chromosome' 键,您必须将as_index=False参数放在 groupby 中:

df2.groupby('Chromosome', as_index=False).apply(position_interval)

this will raise an exception because index 0 is not found for the "1B" group.这将引发异常,因为找不到“1B”组的索引 0。

Replacing the Series slicing by iloc in the function will resolve this problem :在函数中用iloc替换 Series 切片将解决这个问题:

def position_interval(df): 
    df.loc[:,'diffPos'] = round(df['Position'].diff(),1).shift(-1)
    a = []
    i = 0
    while i < df.shape[0]:#omit the last index
        info = df['Marker'].iloc[0]+' '+str(round(df['diffPos'].iloc[i],1))
        #print(info)
        a.append(info)
        i +=1
    #print(a)
    a.insert(0,str(len(df['Marker'])))
    a.insert(0,df['Chromosome'].iloc[0])
    new_info = ' '.join(a).replace(' nan','')#removing the last ' nan'
    #print(new_info)    
    return new_info

Output :输出 :

1A 5 M1 1.2 M1 2.3 M1 2.5 M1 1.3 M1
1B 3 mk1 2.3 mk1 0.9 mk1

Alternative :选择 :

It's possible to iterate over groupby object :可以遍历 groupby 对象:

for i, sub_df in f2.groupby('Chromosome',as_index=False):
    print(position_interval(sub_df))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM