I've got a dataframe that looks something like this:
A B
0 1
1 2
2 3
3 4
I now want to create a column C which does some operations on the values in A and B, but also be the basis for future values.
So, for example a row in C = (prev_value_in_C)/(B+A)
So let's say I initialize first row to have value 5. Then it would look something like
A B C
0 1 5
1 2 5/3 = 1.67
2 3 1.67/5 = .334
3 4 .334/7 =.047
I'm trying to understand if rolling or expanding can be used -- or if such an operation WITHOUT using for loops is possible directly through the present pd tools.
Something sort of like:
df['C'] = df['C'].shift()/(df['A'] + df['B'])
I don't think there is such a command in the pandas DataFrame. I think a for loop is the best idea, looping as many times as there are rows. Then retrieving the C-value of the row before that, doing the calculation and replacing the C-value of the row it is working on.
I think here are necessary loops, because recursive calculations are not vectorisable, for improve performance is used numba :
from numba import jit
@jit(nopython=True)
def f(a, b, first):
c = np.empty(a.shape)
c[0] = first
for i in range(1, a.shape[0]):
c[i] = c[i-1] / (a[i] + b[i])
return c
df['C'] = f(df['A'].to_numpy(), df['B'].to_numpy(), 5)
print (df)
A B C
0 0 1 5.000000
1 1 2 1.666667
2 2 3 0.333333
3 3 4 0.047619
Performance in small DataFrame, 4k rows
:
df = pd.concat([df] * 1000, ignore_index=True)
from numba import jit
@jit(nopython=True)
def f(a, b, first):
c = np.empty(a.shape)
c[0] = first
for i in range(1, a.shape[0]):
c[i] = c[i-1] / (a[i] + b[i])
return c
In [45]: %%timeit
...: df['C1'] = f(df['A'].to_numpy(), df['B'].to_numpy(), 5)
...:
...:
213 µs ± 7.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [46]: %%timeit
...:
...: df['C2'] = 5
...: for i in range(1, len(df)):
...: df.loc[i, 'C2'] = df.loc[i-1, 'C2'] / (df.loc[i, 'A'] + df.loc[i, 'B'])
...:
2.28 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can try this:-
df['C'] = 5
for i in range(1, len(df)):
df.loc[i, 'C'] = df.loc[i-1, 'C'] / (df.loc[i, 'A'] + df.loc[i, 'B'])
Output:-
A B C
0 0 1 5.000000
1 1 2 1.666667
2 2 3 1.000000
3 3 4 0.714286
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.