简体   繁体   中英

equivalent python and pandas operation for group_by + mutate + indexing column vectors within mutate in R

Sample data frame in Python:

d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"], 
     'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)

Now I want to get the same output in Python with pandas as I get in R with the code below. So I want to get the change in percentage in col1 by group in col2.

data.frame(col1 = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
           col2 = c(3, 4, 5, 1, 3, 9, 16, 18, 23)) -> df

df %>%
  dplyr::group_by(col1) %>%
  dplyr::mutate(perc = (dplyr::last(col2) - col2[1]) / col2[1])

In python, I tried:

def perc_change(column):
    index_1 = tu_in[column].iloc[0]
    index_2 = tu_in[column].iloc[-1]
    perc_change = (index_2 - index_1) / index_1  
    return(perc_change)    

d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"], 
     'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
df.assign(perc_change = lambda x: x.groupby["col1"]["col2"].transform(perc_change))

But it gives me an error saying: 'method' object is not subscriptable.

I am new to python and trying to convert some R code into python. How can I solve this in an elegant way? Thank you!

You don't want transform here. transform is typically used when your aggregation returns a scalar value per group and you want to broadcast that result to all rows that belong to that group in the original DataFrame. Because GroupBy.pct_change already returns a result indexed like the original, you aggregate and assign back.

df['perc_change'] = df.groupby('col1')['col2'].pct_change()

#  col1  col2  perc_change
#0    a     3          NaN
#1    a     4     0.333333
#2    a     5     0.250000
#3    b     1          NaN
#4    b     3     2.000000
#5    b     9     2.000000
#6    c     5          NaN
#7    c     7     0.400000
#8    c    23     2.285714

But if instead what you need is the overall percentage change within a group, so it's the difference in the first and last value divided by the first value, you would then want transform.

df.groupby('col1')['col2'].transform(lambda x: (x.iloc[-1] - x.iloc[0])/x.iloc[0])

0    0.666667
1    0.666667
2    0.666667
3    8.000000
4    8.000000
5    8.000000
6    3.600000
7    3.600000
8    3.600000
Name: col2, dtype: float64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM