I have the following data frame:
import pandas as pd
df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1] })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]
Which looks like this:
In [17]: df
Out[17]:
probe gene cellA.1 cellA.2 cellB.1 cellB.2
0 a foo 5 12 15 5
1 b bar 0 90 3 7
2 c qux 1 13 11 11
3 d woz 0 0 2 1
Note that the values are contained in column that shared same substring (eg cellA and cellB). In real case the cell ID can be more than these two and numerical index can also be more (eg CellFoo.5)
What I want to do is to get the average so that it looks like this
probe gene cellA cellB
a foo 9.5 10
b bar 45 5
c qux 7 11
d woz 0 1.5
How can I achieve that with Pandas?
One way would be to make a function which takes a column name and turns it into the group you want to put it in:
>>> df = df.set_index(["probe", "gene"])
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean()
cellA cellB
probe gene
a foo 8.5 10.0
b bar 45.0 5.0
c qux 7.0 11.0
d woz 0.0 1.5
>>> df.groupby(lambda x: x.split(".")[0], axis=1).mean().reset_index()
probe gene cellA cellB
0 a foo 8.5 10.0
1 b bar 45.0 5.0
2 c qux 7.0 11.0
3 d woz 0.0 1.5
Note that we set the index (and reset it afterwards) so we didn't have to special-case the groups we didn't want to touch; also note we had to specify axis=1
because we want to group columnwise, not rowwise.
You can use groupby()
:
import pandas as pd
df = pd.DataFrame({'probe':["a","b","c","d"], 'gene':["foo","bar","qux","woz"], 'cellA.1':[5,0,1,0], 'cellA.2':[12,90,13,0],'cellB.1':[15,3,11,2],'cellB.2':[5,7,11,1] })
df = df[["probe", "gene","cellA.1","cellA.2","cellB.1","cellB.2"]]
mask = df.columns.str.contains(".", regex=False)
df1 = df.loc[:, ~mask]
df2 = df.loc[:, mask]
pd.concat([df1, df2.groupby(lambda name:name.split(".")[0], axis=1).mean()], axis=1)
You could use list comprehension.
In [1]: df['cellA'] = [(x+y)/2. for x,y in zip(df['cellA.1'], df['cellA.2'])]
In [2]: df['cellB'] = [(x+y)/2. for x,y in zip(df['cellB.1'], df['cellB.2'])]
In [3]: df = df[['probe', 'gene', 'cellA', 'cellB']]
In [4]: df
Out [4]:
probe gene cellA cellB
a foo 8.5 10.0
b bar 45.0 5.0
c qux 7.0 11.0
d woz 0.0 1.5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.