简体   繁体   中英

Find the value closest to median in a dataframe

I just pick up python couple months ago and new to this forum. Appreciate if anyone can help. I want to find the value closest to the median.

  1. To get the median for each unit, I am using groupby and median().
  2. Then get the difference from the original dataframe
  3. Use abs() and idxmin() to get the least delta. Basically I end up with another dataframe that has the index of closest value to the median. How do I proceed to use the index to get the actual value?
Unit    Test1   Test2   Test3
Unit1   0.254279388 0.010388754 0.820704593
Unit1   0.957139807 0.207681463 0.738428693
Unit1   0.043462803 0.154220478 0.606568744
Unit2   0.044308884 0.134817932 0.697317637
Unit2   0.244895686 0.909262442 0.153881824
Unit3   0.368147792 0.735655648 0.200679595
Unit3   0.30457518  0.929519313 0.823938759
Unit3   0.537633836 0.661168043 0.736937724
Unit3   0.410137495 0.567494043 0.68300754
Unit3   0.525483757 0.556830631 0.988314575

to

Unit    Test1   Test2   Test3
Unit1   0.254279388 0.154220478 0.738428693
Unit2   0.144602285 0.522040187 0.425599731
Unit3   0.410137495 0.661168043 0.736937724

here is the snippet of the code. Each column should have its own index, but iloc uses the first index for all columns

DATA_MEDIAN = DATA.groupby('Unit').median()

DATA_INDEX = (DATA.set_index(['Unit']) - DATA_MEDIAN).abs().reset_index().groupby('Unit').idxmin()

DATA_INDEX.reset_index(inplace=True)

DATA_CLOSEST = DATA.iloc[DATA_INDEX.index]

Here's a solution. Please note that for units with only two rows, the choice of the value which is closest to the mean is arbitrary.

t = df.melt(id_vars="Unit")
t["distance_from_mean"] = t.groupby(["Unit", "variable"]).transform("mean").subtract(t.value, axis=0).abs()

t = t.loc[t.groupby(["Unit", "variable"], as_index=False)["distance_from_mean"].idxmin()]

res = pd.pivot_table(t, columns="variable", index = "Unit", values="value")
print(res)

The output is:

variable     Test1     Test2     Test3
Unit                                  
Unit1     0.254279  0.154220  0.738429
Unit2     0.044309  0.134818  0.153882
Unit3     0.410137  0.661168  0.683008

If some of the units only have NaN values for a given test, the code needs to be slightly modified. See below (with an example of such a dataframe):

df = pd.DataFrame({
    "Unit": ["Unit1"] * 3 + ["Unit2"] * 2 + ["Unit3"] * 5, 
    "Test1": range(10), 
    "Test2": range(10, 20), 
    "Test3": range(20,30)
})
df.loc[3:4, "Test2"] = np.NaN
print(df)

==>
    Unit  Test1  Test2  Test3
0  Unit1      0   10.0     20
...
3  Unit2      3    NaN     23
4  Unit2      4    NaN     24
5  Unit3      5   15.0     25
...

The code you're looking for:

t = df.melt(id_vars="Unit")
t["distance_from_mean"] = t.groupby(["Unit", "variable"]).transform("mean").subtract(t.value, axis=0).abs()

indices = t.groupby(["Unit", "variable"], as_index=False)["distance_from_mean"].idxmin()
indices.dropna(inplace = True)

t = t.loc[indices]

res = pd.pivot_table(t, columns="variable", index = "Unit", values="value")
print(res)

==>

variable  Test1  Test2  Test3
Unit                         
Unit1       1.0   11.0   21.0
Unit2       3.0    NaN   23.0
Unit3       7.0   17.0   27.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM