简体   繁体   中英

How to average values from columns that shared the same substring in column names

I have the following data frame:

import pandas as pd
df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1]  })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]

Which looks like this:

In [17]: df
Out[17]:
  probe gene  cellA.1  cellA.2  cellB.1  cellB.2
0     a  foo        5       12       15        5
1     b  bar        0       90        3        7
2     c  qux        1       13       11       11
3     d  woz        0        0        2        1

Note that the values are contained in column that shared same substring (eg cellA and cellB). In real case the cell ID can be more than these two and numerical index can also be more (eg CellFoo.5)

What I want to do is to get the average so that it looks like this

     probe gene  cellA  cellB
     a  foo        9.5     10      
     b  bar        45      5       
     c  qux        7       11       
     d  woz        0       1.5        

How can I achieve that with Pandas?

One way would be to make a function which takes a column name and turns it into the group you want to put it in:

>>> df = df.set_index(["probe", "gene"])
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean()
            cellA  cellB
probe gene              
a     foo     8.5   10.0
b     bar    45.0    5.0
c     qux     7.0   11.0
d     woz     0.0    1.5
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean().reset_index()
  probe gene  cellA  cellB
0     a  foo    8.5   10.0
1     b  bar   45.0    5.0
2     c  qux    7.0   11.0
3     d  woz    0.0    1.5

Note that we set the index (and reset it afterwards) so we didn't have to special-case the groups we didn't want to touch; also note we had to specify axis=1 because we want to group columnwise, not rowwise.

You can use groupby() :

import pandas as pd

df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1]  })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]

mask = df.columns.str.contains(".", regex=False)
df1 = df.loc[:, ~mask]
df2 = df.loc[:, mask]
pd.concat([df1, df2.groupby(lambda name:name.split(".")[0], axis=1).mean()], axis=1)

You could use list comprehension.

In [1]: df['cellA'] = [(x+y)/2. for x,y in zip(df['cellA.1'], df['cellA.2'])]
In [2]: df['cellB'] = [(x+y)/2. for x,y in zip(df['cellB.1'], df['cellB.2'])]
In [3]: df = df[['probe', 'gene', 'cellA', 'cellB']]
In [4]: df
Out [4]: 
     probe gene  cellA  cellB
     a     foo   8.5    10.0      
     b     bar   45.0   5.0       
     c     qux   7.0    11.0       
     d     woz   0.0    1.5  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM