简体   繁体   中英

Performing a T-Test on a Multiindex Pandas Dataframe

I'm looking to perform a T-test on various pieces of data in a pandas DataFrame.

I have a dataframe organized like this :

df = pd.DataFrame({'a': {('0hr', '0.01um', 0): 12,
      ('0hr', '0.01um', 1): 10,
      ('0hr', '0.1um', 0): 8,
      ('0hr', '0.1um', 1): 6,
      ('0hr', 'Control', 0): 4,
      ('0hr', 'Control', 1): 2,
      ('24hr', '0.01um', 0): 18,
      ('24hr', '0.01um', 1): 15,
      ('24hr', '0.1um', 0): 12,
      ('24hr', '0.1um', 1): 9,
      ('24hr', 'Control', 0): 6,
      ('24hr', 'Control', 1): 3},
     'b': {('0hr', '0.01um', 0): 42,
      ('0hr', '0.01um', 1): 35,
      ('0hr', '0.1um', 0): 28,
      ('0hr', '0.1um', 1): 21,
      ('0hr', 'Control', 0): 14,
      ('0hr', 'Control', 1): 7,
      ('24hr', '0.01um', 0): 30,
      ('24hr', '0.01um', 1): 25,
      ('24hr', '0.1um', 0): 20,
      ('24hr', '0.1um', 1): 15,
      ('24hr', 'Control', 0): 10,
      ('24hr', 'Control', 1): 5}})

print(df)

                     a   b
    0hr  0.01um  0  12  42
                 1  10  35
         0.1um   0   8  28
                 1   6  21
         Control 0   4  14
                 1   2   7
    24hr 0.01um  0  18  30
                 1  15  25
         0.1um   0  12  20
                 1   9  15
         Control 0   6  10
                 1   3   5

For each column (a,b,etc.) I'd like to calculate perform a t-test comparing the Control for a given time frame to the other tests in that time frame.

For example :

[t, prob] = stats.ttest_ind( df.loc['0hr'].loc['Control'] , df.loc['0hr'].loc['Control'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['0hr'].loc['Control'] , df.loc['0hr'].loc['0.01um'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['0hr'].loc['Control'] , df.loc['0hr'].loc['0.1um'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['24hr'].loc['Control'] , df.loc['24hr'].loc['Control'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['24hr'].loc['Control'] , df.loc['24hr'].loc['0.01um'], 1, equal_var=True)
[t, prob] = stats.ttest_ind( df.loc['24hr'].loc['Control'] , df.loc['24hr'].loc['0.1um'], 1, equal_var=True)

I've been trying to do this with df.apply but I'm not sure what the right syntax is. I'd like to return the results into a new dataframe structured like :

results = pd.DataFrame({'a': {('0hr', '0.01um', 't'): '-',
  ('0hr', '0.01um', 'prob'): '-',
  ('0hr', '0.1um', 't'): '-',
  ('0hr', '0.1um', 'prob'): '-',
  ('0hr', 'Control', 't'): '-',
  ('0hr', 'Control', 'prob'): '-',
  ('24hr', '0.01um', 't'): '-',
  ('24hr', '0.01um', 'prob'): '-',
  ('24hr', '0.1um', 't'): '-',
  ('24hr', '0.1um', 'prob'): '-',
  ('24hr', 'Control', 't'): '-',
  ('24hr', 'Control', 'prob'): '-'},
 'b': {('0hr', '0.01um', 't'): '-',
  ('0hr', '0.01um', 'prob'): '-',
  ('0hr', '0.1um', 't'): '-',
  ('0hr', '0.1um', 'prob'): '-',
  ('0hr', 'Control', 't'): '-',
  ('0hr', 'Control', 'prob'): '-',
  ('24hr', '0.01um', 't'): '-',
  ('24hr', '0.01um', 'prob'): '-',
  ('24hr', '0.1um', 't'): '-',
  ('24hr', '0.1um', 'prob'): '-',
  ('24hr', 'Control', 't'): '-',
  ('24hr', 'Control', 'prob'): '-'}})

Ok, not completely sure that I've understood the situation, but I think this would be the way to handle the MultiIndex.

In [195]:

index = pd.MultiIndex.from_product([set(df.index.get_level_values(0)), set(df.index.get_level_values(1)), ['t', 'p']])
result = pd.DataFrame(columns=['a', 'b'], index=index)

for time in set(df.index.get_level_values(0)):
    for condition in set(df.index.get_level_values(1)) - set(['Control']):
        t, p = stats.ttest_ind( df.loc[time].loc['Control'] , df.loc[time].loc[condition], 1, equal_var=True)
        result.loc[(time, condition, 't')] = t
        result.loc[(time, condition, 'p')] = p
print result

And the result:

                        a           b
0hr  Control t        NaN         NaN
             p        NaN         NaN
     0.01um  t -0.6706134   -1.412036
             p  0.5715365   0.2934382
     0.1um   t -0.8049845    -1.13842
             p  0.5053153   0.3729403
24hr Control t        NaN         NaN
             p        NaN         NaN
     0.01um  t  -2.529822   -3.137858
             p  0.1271284  0.08831539
     0.1um   t  -1.788854   -2.529822
             p  0.2155355   0.1271284

You could easily fill in the Control lines if you needed to, but as you say the results are predictable.

Hope it helps anyway.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM