简体   繁体   中英

How to iterate with result of previous rows of same column?

Starting from a Data Frame with the columns ABDP:

import numba
import numpy as np
import pandas as pd
import vaex

d = {'A':[0,1,2,3,4,5,6],'B':[30,35,32,35,31,39,37],'D':[12,10,13,19,12,21,13],'P':[3,3,3,3,3,3,3]}
df = pd.DataFrame(data=d)

df['C'] = 0

df

Exporting to hdf5:

df.to_csv("v_df", index=False)
    
df = vaex.from_csv("v_df", convert=True, chunk_size=5_000_000)

I need to calculate the column 'C' with each row with a desired result as follow:

'C': [0, 1.666666667, 0.552380952, 2.333630952, 0.962202381, 6.38155722, 5.714890553]

To reach this result I need to iterate each row starting on row (1) and calculate something like:

%%time
@numba.njit
def func(B, D, C, b_0=0):
    n = len(B)
    b = np.full(n, b_0, dtype=np.float64)
    for i in range(1, n):
        b[i] = ((((B[i] - B[i - 1]) / B[i - 1])) * D[i]) + C[i-1]
    return b
df['C'] = func(df['B'].to_numpy(),df['D'].to_numpy(),df['C'].to_numpy())
df

But it doesn't work.

It results ok for the first and second row:

'C': [0, 1.666666667, -1.114286, 1.781250, -1.371429, 5.419355, -0.666667]

I've tried the 'shift' (suggestion by sammywemmy). It works ok for this example data frame before 'vaex.from_csv', but when implemented the concept to the big data frame via vaex, shift is not working.

So, the question is: Is there a 'good practice' way to perform this calculus (iterate the result of a row in column 'C' considering the previous rows in 'C') with a large data source (I´m using vaex to import from a 1 million rows csv)?

Thanks in advance.

I am not sure it is the optimal solution but at least it is working: you can use the apply method with vectorize set to True .

Here is the full snippet:

import numba
import numpy as np
import pandas as pd
import vaex

d = {'A':[0,1,2,3,4,5,6],'B':[30,35,32,35,31,39,37],'D':[12,10,13,19,12,21,13],'P':[3,3,3,3,3,3,3]}
df = pd.DataFrame(data=d)

df['C'] = 0

df

# I removed the b_0 for simplicity
def my_func(B, D, C):
    n = len(B)
    b = np.full(n, 0, dtype=np.float64)
    for i in range(1, n):
        b[i] = ((((B[i] - B[i - 1]) / B[i - 1])) * D[i]) + C[i-1]

    return b

df_vaex = vaex.from_pandas(df)

df_vaex.apply(my_func, arguments=[df_vaex["B"], df_vaex["D"], df_vaex["C"]], vectorize=True)

It gives the expected output:

0          0
1    1.66667
2   -1.11429
3    1.78125
4   -1.37143
5    5.41935
6  -0.666667

Basically, apply method allows to apply functions row by row and if you set vectorize to True the full arrays will be passed instead of the row values. In your case it is mandatory since you need the values from the previous rows to compute the C value in a given row.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM