简体   繁体   中英

Group pandas columns by word in common in column name?

I have a data set like this:

seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B   7         2       9       2       1       9       2         1       1 
C   NaN       4       4       2       4       NaN     2         6       8
D   5         NaN     NaN     2       5       9       NaN       1       1 

I want to get a data frame that:

(1) calculates the mean of all the columns with T01 in them
(2) gets the mean per S-number except for T01 (i.e. get the mean of T02 and T03, for each S field)
(3) get the mean of the list of numbers returned from step 2 (i.e. step 2 will return a list of means, one for each S-number, i then want the mean of that list).

So the output for above would be:

   T0_means    mean_of_other_means
B  3.6         3.83
C  1.3         4.33
D  2.3         2.6

(i just in my head changed the NaNs to 0 for averaging).

I'm getting stuck at the first step, I wrote:

import sys
import pandas as pd

df = pd.read_csv('fat_norm_extracted.csv',sep=',')
list_cols_to_keep = ['S01-T01','S02-T01','S03-T01']
df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]
print(df)

And the error is:

Traceback (most recent call last):
  File "calculate_averages.py", line 6, in <module>
    df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]
  File "/home/slowat/.conda/envs/embedding_nlp/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/slowat/.conda/envs/embedding_nlp/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'column_name'

I know what the error means, that column name is being taken as a string, but not how to fix it. Could someone show me a way around this?

You can use str.contains to flag the column names that includes T01 via boolean mask msk . Then filter the columns using loc and find mean across columns for T01_means . For mean_of_other_means , you can use msk by using groupby.cumsum on it to create groups; then use groupby.mean across columns to find group means; then use mean yet again to find the mean of means:

df = df.set_index('seq').fillna(0)
msk = df.columns.str.contains('T01')
df['T0_means'] = df.loc[:, msk].mean(axis=1)
df['mean_of_other_means'] = df.drop(columns='T0_means').loc[:, ~msk].groupby(msk.cumsum()[~msk], axis=1).mean().mean(axis=1)
df = df.reset_index()

Output:

  seq  S01-T01  S01-T02  S01-T03  S02-T01  S02-T02  S02-T03  S03-T01  S03-T02  S03-T03  T0_means  mean_of_other_means
0   B      7.0      2.0      9.0        2        1      9.0      2.0        1        1  3.666667             3.833333
1   C      0.0      4.0      4.0        2        4      0.0      2.0        6        8  1.333333             4.333333
2   D      5.0      0.0      0.0        2        5      9.0      0.0        1        1  2.333333             2.666667

The line

df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]

Is filtering the rows of df where the values of the column named column_name are in the list of value of list_of_cols_to_keep .

If you want to select the columns, you can do:

df = df.loc[:, list_of_cols_to_keep]

Where : is for all rows.

Otherwise you can also use:

df = df.filter(list_of_cols_to_keep)

First, to me it seems like mean_of_means is nothing but mean of all columns that don't end in T01 because consider row B :

    S01-T02 S02-T02 S03-T02 mean
B      2      1      9      (2+1+9)/3

   S01-T03 S02-T03 S03-T03 mean
B      9      9       1     (9+9+1/3)

Then the mean of above two means is: ( (2+1+9)/3 + (9+9+1)/3 ) / 2 = (2+1+9+9+9+1)/6 which is nothing but the mean of all columns that don't end in T01!

With that I think you can do:

df = df.fillna(0)
T01_means = df.filter(regex='.*T01$',axis=1).mean(axis=1)
mean_of_means_no_T01 = df.filter(regex='.*(?<!T01)$',axis=1).mean(axis=1)

and then

means_df = pd.concat([T01_means, mean_of_means_no_T01],axis=1)
means_df.columns = ['T01_means', 'mean_of_means_no_T01']

means_df
    T01_means   mean_of_means_no_T01
B   3.666667    3.833333
C   1.333333    4.333333
D   2.333333    2.666667

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM