How to improve the calculation's speed in python?

Question

I'm building a calculation to add a new column to my dataframe. Here is my data:

I need to create a new column "mob". The calculation of "mob" is that

if the "LoanID" of a certain row is the same as the one of the previous row. For example, if loan['LoanId'][0] = loan['LoanId'] 1 ;
if the "mob" the previous row is >0; if so, then the "mob" value of this row will add 1 from the value of previous row; if not, try if the loan['repay_lbl'] of the row is 1 or 2, if so, the "mob" value of the row will be 1;

My code is as below:

for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
    if loan['mob'][i-1] > 0:
        loan['mob'][i] = loan['mob'][i-1] +1 
    elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
        loan['mob'][i] = 1

The code will cost O(n). Is there any way to improve the algorithm and speed up? I'm just a beginner of Python. I would appreciate so much for your help.

Answer 1

Since the value of the mob column for each row depends on that of the previous row, it depends on all previous rows . That means that you can't run this in parallel and you're basically stuck with O(n) .

So I don't think that numpy array operations are going to be of much use here.

Failing that, there is the usual bag of tricks to speed up Python code;

PyPy
Cython
Numba

I'm not sure if the first two work well with numpy/pandas. You might have to use normal Python lists for your data in those cases.

Of course before you dive into any of these, you should consider whether your data set is large enough to warrant the effort.

Answer 2

Improving Time by Changing Looping Method

Improving loop time based upon

Looping through all N rows without broadcasting, so complexity is O(N)
While all are order N, different looping methods have different complexity scaling factors
The different scaling factors make some methods much faster than others

Inspired by - Different ways to iterate over rows in a Pandas Dataframe — performance comparison

Methods

For loop -- original post
iterrows
itertuples
zip

Summary

The zip method was 93x faster than for loop (ie OP method) for 100K rows

Test Code

import pandas as pd
import numpy as np
from random import randint

def create_input(N):
    ' Creates a loan DataFrame with N rows '
    LoanId = [randint(0, N //4) for _ in range(N)]  # though random, N//4 ensures
                                                    # high likelihood some rows repeat
                                                    # LoanID
    repay_lbl = [randint(0, 2) for _ in range(N)]

    data = {'LoanId':LoanId, 'repay_lbl': repay_lbl, 'mob':[0]*N}
    return pd.DataFrame(data)

def m_itertuples(loan):
    ' Iterating using itertuples, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob = None, None
    for index, row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
        if prev_loanID is not None:
             if prev_loanID == row.LoanId:
                if prev_mob > 0:
                    loan.at[row.Index, 'mob'] = prev_mob + 1 
                elif row.repay_lbl == 1 or row.repay_lbl == 2:
                    loan.at[row.Index, 'mob'] = 1
            
        # Query for latest values   
        prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
                    
    return loan
    
def m_for_loop(loan):
    ' For loop over the data frame '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
            
    for i in range(1,len(loan['LoanId'])):
        if loan['LoanId'][i-1] == loan['LoanId'][i]:
            if loan['mob'][i-1] > 0:
                loan['mob'][i] = loan['mob'][i-1] +1 
            elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
                loan['mob'][i] = 1
    return loan

def m_iterrows(loan):
    ' Iterating using iterrows, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob = None, None
    for index, row in loan.iterrows(): # iterate over rows with iterrows()
        if prev_loanID is not None:
             if prev_loanID == row['LoanId']:
                if prev_mob > 0:
                    loan.at[index, 'mob'] = prev_mob + 1 
                elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
                    loan.at[index, 'mob'] = 1
                    
        # Query for latest values          
        prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
        
    return loan

def m_zip(loan):
    ' Iterating using zip, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob  = None, None
    for index, (loanID, mob, repay_lbl) in enumerate(zip(loan['LoanId'], loan['mob'], loan['repay_lbl'])):
        if prev_loanID is not None:
             if prev_loanID == loanID:
                if prev_mob > 0:
                    mob = loan.at[index, 'mob'] = prev_mob + 1
                elif repay_lbl == 1 or repay_lbl == 2:
                    mob = loan.at[index, 'mob'] = 1
        
        # Update to latest values
        prev_loanID, prev_mob = loanID, mob
        
    return loan

Note: Iterator code queried dataframe for updated data rather than getting from iterator do to warning :

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

Also compared DataFrames using assert df1.equals(df2) to verify the different methods produced identical results

Timing Code

Using benchit

inputs = [create_input(i) for i in 10**np.arange(6)]  # 1 to 10^5 rows
funcs = [m_for_loop, m_iterrows, m_itertuples, m_zip]

t = benchit.timings(funcs, inputs)

Results

Run time in seconds

Functions  m_for_loop  m_iterrows  m_itertuples     m_zip
Len                                                      
1            0.000217    0.000493      0.000781  0.000327
10           0.001070    0.002002      0.001008  0.000353
100          0.007100    0.016501      0.003062  0.000498
1000         0.056940    0.162423      0.021396  0.001057
10000        0.565809    1.625043      0.210858  0.006938
100000       5.890920   16.658842      2.179602  0.062953

How to improve the calculation's speed in python?

Question

2 answers

solution1
1 2020-08-20 17:20:35

solution2
1 ACCPTED 2020-08-21 10:25:47

How to improve the calculation's speed in python?

Question

2 answers

solution1 1 2020-08-20 17:20:35

solution2 1 ACCPTED 2020-08-21 10:25:47

solution1
1 2020-08-20 17:20:35

solution2
1 ACCPTED 2020-08-21 10:25:47