简体   繁体   中英

Python Pandas -- Create dataframe column based off of its own previous value in earlier rows

I've got a dataframe that looks something like this:

A B
0 1
1 2 
2 3
3 4

I now want to create a column C which does some operations on the values in A and B, but also be the basis for future values.

So, for example a row in C = (prev_value_in_C)/(B+A)

So let's say I initialize first row to have value 5. Then it would look something like

A B C
0 1 5
1 2 5/3    = 1.67
2 3 1.67/5 = .334
3 4 .334/7 =.047

I'm trying to understand if rolling or expanding can be used -- or if such an operation WITHOUT using for loops is possible directly through the present pd tools.

Something sort of like:

df['C'] = df['C'].shift()/(df['A'] + df['B'])

I don't think there is such a command in the pandas DataFrame. I think a for loop is the best idea, looping as many times as there are rows. Then retrieving the C-value of the row before that, doing the calculation and replacing the C-value of the row it is working on.

I think here are necessary loops, because recursive calculations are not vectorisable, for improve performance is used numba :

from numba import jit

@jit(nopython=True)
def f(a, b, first):
    c = np.empty(a.shape)
    c[0] = first
    for i in range(1, a.shape[0]):
        c[i] = c[i-1] / (a[i] + b[i])
    return c

df['C'] = f(df['A'].to_numpy(), df['B'].to_numpy(), 5)
print (df)
   A  B         C
0  0  1  5.000000
1  1  2  1.666667
2  2  3  0.333333
3  3  4  0.047619

Performance in small DataFrame, 4k rows :

df = pd.concat([df] * 1000, ignore_index=True)

from numba import jit

@jit(nopython=True)
def f(a, b, first):
    c = np.empty(a.shape)
    c[0] = first
    for i in range(1, a.shape[0]):
        c[i] = c[i-1] / (a[i] + b[i])
    return c


In [45]: %%timeit
    ...: df['C1'] = f(df['A'].to_numpy(), df['B'].to_numpy(), 5)
    ...: 
    ...: 
213 µs ± 7.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [46]: %%timeit
    ...: 
    ...: df['C2'] = 5
    ...: for i in range(1, len(df)):
    ...:     df.loc[i, 'C2'] = df.loc[i-1, 'C2'] / (df.loc[i, 'A'] + df.loc[i, 'B'])
    ...:     
2.28 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can try this:-

df['C'] = 5
for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] / (df.loc[i, 'A'] + df.loc[i, 'B'])

Output:-

   A  B         C
0  0  1  5.000000
1  1  2  1.666667
2  2  3  1.000000
3  3  4  0.714286

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM