简体   繁体   中英

Filter pandas dataframe by quantile based on the value of another column

I'm having trouble applying some filter to my dataset, can't get my head around on how to do it. I have a dataframe that consist of 7 columns. One column, participant_id that identifies a participant (numerical value unique to each participant), a second column that indicates to which group each partipant belongs to ( group ), a third column called trial that identifies the trial (each participant performs multiple trials of a task) and four columns var 1 , var 2 , var 3 , var 4 (a numerical value corresponding to four variables recorded of each trial). There are participants with 100+ trials, and participants with around 50 trials.

Short example:

       participant_id        group         trial       var1        var2       var3        var4        
         189                   A             1      -0.231046    0.245615  -0.581238   -0.593562  
         189                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             3      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             4      -0.231046    0.245615  -0.581238   -0.593562
         345                   B             1         NaN       0.245615  -0.581238   -0.593562
         345                   B            378     -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         432                   B            517     -0.231046    0.245615     NaN      -0.593562
         432                   B             2      -0.231046    0.245615     NaN      -0.593562
         432                   B            333     -0.231046    0.245615  -0.581238   -0.593562

Example for 4 participants. Note that the trial number is not unique, as two different participants can have a trial '1', and the same participant also can have multiple trial '1'. (Don't mind that the value of the var columns is the same for every row, in the real dataset that's not the case). The real dataset is 10000 rows.

What I need to do is to keep the top n quantile trials of each unique participant based on the value of var1 . I was thinking of using the pandas.quantile(n) function, but I don't know how to tell it that it should analyze each participant separately, and not the whole row to get the quantile.

Any help is appreciated.

  • apply() on a group is passed a Dataframe of each group
  • have provided example using lambda as well as stand alone function
  • sample data as noted does not have any feature to demonstrate it's function
df = pd.read_csv(io.StringIO("""       participant_id        group         trial       var1        var2       var3        var4        
         189                   A             1      -0.231046    0.245615  -0.581238   -0.593562  
         189                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             3      -0.231046    0.245615  -0.581238   -0.593562
         189                   A             4      -0.231046    0.245615  -0.581238   -0.593562
         345                   B             1         NaN       0.245615  -0.581238   -0.593562
         345                   B            378     -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             1      -0.231046    0.245615  -0.581238   -0.593562
         227                   A             2      -0.231046    0.245615  -0.581238   -0.593562
         432                   B            517     -0.231046    0.245615     NaN      -0.593562
         432                   B             2      -0.231046    0.245615     NaN      -0.593562
         432                   B            333     -0.231046    0.245615  -0.581238   -0.593562
"""), sep="\s+")

def f(d):
    return d.quantile()

df.groupby("participant_id", as_index=False).apply(f)
df.groupby("participant_id", as_index=False).apply(lambda d: d.quantile())

updated

  • question was how to utilise pandas quantile capabilities within groups, answered by first part
  • this specifically shows how to use this approach to filter to rows within quantiles
  • have generated a random dataset that is same shape for demonstration purposes
# generate a useful dataset to analyse...
s = 100
df = pd.DataFrame({"participant_id":np.random.choice([189,227,345,432],s),
             "trial":np.random.randint(1,6,s),
             "group":np.random.choice(["A","B"],s),
              "var1":np.random.rand(s),
              "var2":np.random.rand(s),
              "var3":np.random.rand(s),
              "var4":np.random.rand(s),
             })

# split into 10 quantile bins and take 0th bin
# split into 10 quantile bins and take 0th bin
df.groupby("participant_id").apply(
    lambda d: d.loc[pd.qcut(d.var1, q=10, retbins=False, labels=False).le(0)]).droplevel(0)


participant_id trial group var1 var2 var3 var4
13 189 3 A 0.0273875 0.87134 0.555792 0.67094
85 189 1 A 0.0106758 0.352578 0.481009 0.910989
91 189 4 A 0.00835706 0.644102 0.990459 0.816669
4 227 5 B 0.10132 0.870446 0.389972 0.313782
71 227 4 B 0.0221867 0.17566 0.659024 0.910838
74 227 1 A 0.0762526 0.458995 0.492384 0.556408
28 345 5 A 0.130674 0.0336628 0.0429884 0.799307
82 345 1 B 0.234522 0.371108 0.451911 0.54528
93 345 4 B 0.272915 0.594262 0.392285 0.56374
16 432 2 B 0.0403964 0.880132 0.45438 0.0466626
18 432 5 A 0.0884496 0.304541 0.969059 0.949315
22 432 2 B 0.115796 0.0371306 0.631284 0.537881
55 432 2 B 0.102859 0.0416843 0.761466 0.56438

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM