I use the function:
def df_proc(df, n):
print (list(df.lab).count(0)) # control label to see if it changes after conditional dropping
print ('C:', list(df.lab).count(1))
df = df.drop(df[df.lab.eq(0)].sample(n).index)
print (list(df.lab).count(0))
print ('C:', list(df.lab).count(1))
return df
To drop pandas rows based on certain conditions (where df.lab == 0). This works fine on a small df (eg n = 100) however when I increase the number of rows in the df something odd happens... the counts of other labels (.= 0) also begin to decrease and are affected by the condition..
For example:
# dummy example:
import random
list2 = [random.randrange(0, 6, 1) for i in range(1500000)]
list1 = [random.randrange(0, 100, 1) for i in range(1500000)]
dft = pd.DataFrame(list(zip(list1, list2)), columns = ['A', 'lab'])
dftest = df_proc(dft,100000)
gives...
249797
C: 249585
149797
C: 249585
But when I run this on my actual df:
dftest = df_proc(S1,100000)
I get a change in my control labels which is weird.
467110
C: 70434
260616
C: 49395
I'm not sure where the error could have come from. I have tried using frac
and df.query('lab == 0')
but still run into the same error. The other thing I noticed is that with small n
the control labels are unchanged, its only when I increase n
.
dftest = df_proc(S1,1)
gives:
467110
C: 70434
467107
C: 70434
Which doesnt add up as 3 samples have been removed not 1.
If it's only about filtering, why not use:
dft = dft[dft['lab'] != 0]
This will filter out all rows with lab=0
.
The error was that when drop
is used it eliminates based on index
however my df was a concatenation of serveral dataframes hence I had to use reset_index
to overcome the problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.