Odd dropping of pandas rows based on conditions

Question

I use the function:

def df_proc(df, n):
    print (list(df.lab).count(0)) # control label to see if it changes after conditional dropping
    print ('C:', list(df.lab).count(1))

    df = df.drop(df[df.lab.eq(0)].sample(n).index)

    print (list(df.lab).count(0))
    print ('C:', list(df.lab).count(1))

    return df

To drop pandas rows based on certain conditions (where df.lab == 0). This works fine on a small df (eg n = 100) however when I increase the number of rows in the df something odd happens... the counts of other labels (.= 0) also begin to decrease and are affected by the condition..

For example:

# dummy example:
import random
list2 = [random.randrange(0, 6, 1) for i in range(1500000)] 
list1 = [random.randrange(0, 100, 1) for i in range(1500000)] 
dft = pd.DataFrame(list(zip(list1, list2)), columns = ['A', 'lab'])
dftest = df_proc(dft,100000)

gives...

But when I run this on my actual df:

dftest = df_proc(S1,100000)

I get a change in my control labels which is weird.

I'm not sure where the error could have come from. I have tried using frac and df.query('lab == 0') but still run into the same error. The other thing I noticed is that with small n the control labels are unchanged, its only when I increase n .

dftest = df_proc(S1,1)

gives:

Which doesnt add up as 3 samples have been removed not 1.

Answer 1

If it's only about filtering, why not use:

dft = dft[dft['lab'] != 0]

This will filter out all rows with lab=0 .

Answer 2

The error was that when drop is used it eliminates based on index however my df was a concatenation of serveral dataframes hence I had to use reset_index to overcome the problem.

Odd dropping of pandas rows based on conditions

Question

2 answers

solution1
0 2020-04-16 13:37:02

solution2
0 ACCPTED 2020-04-16 14:04:48

Odd dropping of pandas rows based on conditions

Question

2 answers

solution1 0 2020-04-16 13:37:02

solution2 0 ACCPTED 2020-04-16 14:04:48

solution1
0 2020-04-16 13:37:02

solution2
0 ACCPTED 2020-04-16 14:04:48