This is my first time using Stack Overflow. I am quite new to coding and Pandas, so please bear with me. I am practicing manipulating data using Python/Pandas instead of Excel, and I've come across the following problem...
I am trying to standardize values for particular columns by year. My data set is rather small so the approach I took (shown below) works well, however, I'm fairly certain it is not a great way to accomplish this task. Is there a better way to do this with list comprehensions or applying a function to the DataFrame? (PS any other resources you could recommend for learning about those topics or for examples would be greatly appreciated!)
Sample Data:
IN: df = pd.DataFrame(data=[[2018,10,100,50], [2018,11,110,30], [2017,12,120,10], [2017, 15, 115, 40]], columns=['Year','c1','c2','c3'])
OUT:
Year c1 c2 c3
0 2018 10 100 50
1 2018 11 110 30
2 2017 12 120 10
3 2017 15 115 40
Sample Output:
Year c1 c2 c3 c1_std c2_std
0 2018 10 100 50 -0.707107 -0.707107
1 2018 11 110 30 0.707107 0.707107
2 2017 12 120 10 0.707107 0.707107
3 2017 15 115 40 -0.707107 -0.707107
Notice the standardized output is only for 2 of the 3 columns
My approach:
First I created two tables. One for the means by column and year as well as one for standard deviations by column and year.
standard_devs = pd.DataFrame(data=[],index=[2018,2017], columns=['c1', 'c2']) means = pd.DataFrame(data=[],index=[2018,2017], columns=['c1', 'c2']) for y in [2018,2017]: for col in ['c1', 'c2']: standard_devs.loc[y,col] = df[df['Year']==y][col].std() means.loc[y,col] = df[df['Year']==y][col].mean()
I iterated through my original data frame and calculated the standardized values based on the appropriate year and column.
for i in list(df.index): for col in ['c1', 'c2']: year = df.loc[i,'Year'] df.loc[i,col+'_std'] = (df.loc[i,col]-means.loc[year, col])/standard_devs.loc[year, col]
I read before that iterating through a pandas DataFrame is bad practice. I know this method probably cannot scale, so I was wondering how I could be more efficient with my coding.
Thank you all!
You can use groupby.transform
here to calculate std
and mean
. This will calculate the appropriate metric by group and return a Series with the same axis length of df
:
for c in ['c1', 'c2']:
stds = df.groupby('Year')[c].transform('std')
means = df.groupby('Year')[c].transform('mean')
df[f'{c}_std'] = (df[c] - means) / stds
An alternative approach, would be to temporarily set your index to your groupby key:
means = df.groupby('Year')[['c1', 'c2']].mean()
stds = df.groupby('Year')[['c1', 'c2']].std()
(df.join((((df.set_index('Year') - means) / stds))
.reset_index(drop=True)
.add_suffix('_std')))
[out]
Year c1 c2 c3 c1_std c2_std
0 2018 10 100 50 -0.707107 -0.707107
1 2018 11 110 30 0.707107 0.707107
2 2017 12 120 10 -0.707107 0.707107
3 2017 15 115 40 0.707107 -0.707107
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.