df.drop in a for loop taking too long

Question

I am having issues with my code taking too long to run, and am unsure about how to make it more efficient/ run faster.

I have a data frame of about 3.5 million observations and have this code as my criteria within the for loop. If I drop the trades = trades.drop([i], axis=0) section then it takes seconds and doesn't drop the rows, however, with this line of code it hasn't even stopped running yet.

Ideally I would allocate all of the rows where the for loop is successful into another data frame as I know what to do from there, but I am unsure about how to do that which is why the.drop is there.

Any help would be much appreciated.

# Drop the rows/ trades that are greater than the ASK_PLUS_SPREAD
counter = 0
for i in range(0, len(trades['ASK_PLUS_SPREAD'])):
    if trades['PRICE'][i] > trades['ASK_PLUS_SPREAD'][i]:
        counter = counter + 1
        counts_q['F5'] = counter
        trades = trades.drop([i], axis=0)
    else:
        pass
    
# Drop the rows/ trades that are lower than the BID_MINUS_SPREAD and add the amount taken to the counter
for i in range(0, len(trades['BID_MINUS_SPREAD'])):
    if trades['PRICE'][i] < trades['BID_MINUS_SPREAD'][i]:
        counter = counter + 1
        counts_q['F5'] = counter 
        trades = trades.drop([i], axis = 0)
    else:
        pass

Answer 1

This will give you the trades df with only rows where PRICE is greater than ASK_PLUS_SPREAD .

trades = trades[trades['PRICE'] > trades['ASK_PLUS_SPREAD]]

If you want some sort of comparison of how many rows were removed, I'd look into comparing trades.shape before and after the change.

Answer 2

What you did in the line

 trades = trades.drop([i], axis=0)

is to create a new DataFrame by dropping one row from the old DataFrame. That' s why it is slow. You can parallelize your operation by using Boolean indexing. Consider

trades['PRICE'] > trades['ASK_PLUS_SPREAD']

this takes the column PRICE and column ASK_PLUS_SPREAD as two series and compare, which the result is a new series, in Boolean.

Now you can do two things with it: Firstly,

count = (trades['PRICE'] > trades['ASK_PLUS_SPREAD']).sum()

which counts how many PRICE greater than ASK_PLUS_SPREAD (since True is regarded as 1 and False is 0 in this case). Secondly,

trades = trades[trades['PRICE'] > trades['ASK_PLUS_SPREAD']]

which selects only the rows that fits the condition. In fact, if you don't need to count which is less and which is greater, you can replace the entire code in your example with the following:

condition = (trades['PRICE'] > trades['ASK_PLUS_SPREAD']) | (trades['PRICE'] < trades['BID_MINUS_SPREAD'])
counter = condition.sum()
trades = trades[condition]

which | is a bitwise OR. You need to use bitwise operation to combine boolean series.

df.drop in a for loop taking too long

Question

2 answers

solution1
0 2022-04-11 17:47:30

solution2
0 2022-04-11 17:59:47

df.drop in a for loop taking too long

Question

2 answers

solution1 0 2022-04-11 17:47:30

solution2 0 2022-04-11 17:59:47

solution1
0 2022-04-11 17:47:30

solution2
0 2022-04-11 17:59:47