Efficiently apply calculation to Pandas DataFrame based on condition?

Question

This is my first time using Stack Overflow. I am quite new to coding and Pandas, so please bear with me. I am practicing manipulating data using Python/Pandas instead of Excel, and I've come across the following problem...

I am trying to standardize values for particular columns by year. My data set is rather small so the approach I took (shown below) works well, however, I'm fairly certain it is not a great way to accomplish this task. Is there a better way to do this with list comprehensions or applying a function to the DataFrame? (PS any other resources you could recommend for learning about those topics or for examples would be greatly appreciated!)

Sample Data:

IN: df = pd.DataFrame(data=[[2018,10,100,50], [2018,11,110,30], [2017,12,120,10], [2017, 15, 115, 40]], columns=['Year','c1','c2','c3'])
OUT:
   Year  c1   c2   c3
0  2018  10  100   50
1  2018  11  110   30
2  2017  12  120   10
3  2017  15  115   40

Sample Output:

    Year    c1  c2  c3    c1_std      c2_std
0   2018    10  100 50  -0.707107   -0.707107
1   2018    11  110 30  0.707107    0.707107
2   2017    12  120 10  0.707107    0.707107
3   2017    15  115 40  -0.707107   -0.707107

Notice the standardized output is only for 2 of the 3 columns

My approach:

First I created two tables. One for the means by column and year as well as one for standard deviations by column and year.

 standard_devs = pd.DataFrame(data=[],index=[2018,2017], columns=['c1', 'c2']) means = pd.DataFrame(data=[],index=[2018,2017], columns=['c1', 'c2']) for y in [2018,2017]: for col in ['c1', 'c2']: standard_devs.loc[y,col] = df[df['Year']==y][col].std() means.loc[y,col] = df[df['Year']==y][col].mean()

I iterated through my original data frame and calculated the standardized values based on the appropriate year and column.

 for i in list(df.index): for col in ['c1', 'c2']: year = df.loc[i,'Year'] df.loc[i,col+'_std'] = (df.loc[i,col]-means.loc[year, col])/standard_devs.loc[year, col]

I read before that iterating through a pandas DataFrame is bad practice. I know this method probably cannot scale, so I was wondering how I could be more efficient with my coding.

Thank you all!

Answer 1

You can use groupby.transform here to calculate std and mean . This will calculate the appropriate metric by group and return a Series with the same axis length of df :

for c in ['c1', 'c2']:
    stds = df.groupby('Year')[c].transform('std')
    means = df.groupby('Year')[c].transform('mean')
    df[f'{c}_std'] = (df[c] - means) / stds

An alternative approach, would be to temporarily set your index to your groupby key:

means = df.groupby('Year')[['c1', 'c2']].mean()
stds = df.groupby('Year')[['c1', 'c2']].std()

(df.join((((df.set_index('Year') - means) / stds))
         .reset_index(drop=True)
         .add_suffix('_std')))

[out]

   Year  c1   c2  c3    c1_std    c2_std
0  2018  10  100  50 -0.707107 -0.707107
1  2018  11  110  30  0.707107  0.707107
2  2017  12  120  10 -0.707107  0.707107
3  2017  15  115  40  0.707107 -0.707107

Efficiently apply calculation to Pandas DataFrame based on condition?

Question

1 answers

solution1
1 2020-09-04 15:56:56

Efficiently apply calculation to Pandas DataFrame based on condition?

Question

1 answers

solution1 1 2020-09-04 15:56:56

solution1
1 2020-09-04 15:56:56