简体   繁体   中英

Odd dropping of pandas rows based on conditions

I use the function:

def df_proc(df, n):
    print (list(df.lab).count(0)) # control label to see if it changes after conditional dropping
    print ('C:', list(df.lab).count(1))

    df = df.drop(df[df.lab.eq(0)].sample(n).index)

    print (list(df.lab).count(0))
    print ('C:', list(df.lab).count(1))

    return df

To drop pandas rows based on certain conditions (where df.lab == 0). This works fine on a small df (eg n = 100) however when I increase the number of rows in the df something odd happens... the counts of other labels (.= 0) also begin to decrease and are affected by the condition..

For example:

# dummy example:
import random
list2 = [random.randrange(0, 6, 1) for i in range(1500000)] 
list1 = [random.randrange(0, 100, 1) for i in range(1500000)] 
dft = pd.DataFrame(list(zip(list1, list2)), columns = ['A', 'lab'])
dftest = df_proc(dft,100000)

gives...

249797
C: 249585
149797
C: 249585

But when I run this on my actual df:

dftest = df_proc(S1,100000)

I get a change in my control labels which is weird.

467110
C: 70434
260616
C: 49395

I'm not sure where the error could have come from. I have tried using frac and df.query('lab == 0') but still run into the same error. The other thing I noticed is that with small n the control labels are unchanged, its only when I increase n .

dftest = df_proc(S1,1)

gives:

467110
C: 70434
467107
C: 70434

Which doesnt add up as 3 samples have been removed not 1.

If it's only about filtering, why not use:

dft = dft[dft['lab'] != 0]

This will filter out all rows with lab=0 .

The error was that when drop is used it eliminates based on index however my df was a concatenation of serveral dataframes hence I had to use reset_index to overcome the problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM