I just pick up python couple months ago and new to this forum. Appreciate if anyone can help. I want to find the value closest to the median.
Unit Test1 Test2 Test3
Unit1 0.254279388 0.010388754 0.820704593
Unit1 0.957139807 0.207681463 0.738428693
Unit1 0.043462803 0.154220478 0.606568744
Unit2 0.044308884 0.134817932 0.697317637
Unit2 0.244895686 0.909262442 0.153881824
Unit3 0.368147792 0.735655648 0.200679595
Unit3 0.30457518 0.929519313 0.823938759
Unit3 0.537633836 0.661168043 0.736937724
Unit3 0.410137495 0.567494043 0.68300754
Unit3 0.525483757 0.556830631 0.988314575
to
Unit Test1 Test2 Test3
Unit1 0.254279388 0.154220478 0.738428693
Unit2 0.144602285 0.522040187 0.425599731
Unit3 0.410137495 0.661168043 0.736937724
here is the snippet of the code. Each column should have its own index, but iloc uses the first index for all columns
DATA_MEDIAN = DATA.groupby('Unit').median()
DATA_INDEX = (DATA.set_index(['Unit']) - DATA_MEDIAN).abs().reset_index().groupby('Unit').idxmin()
DATA_INDEX.reset_index(inplace=True)
DATA_CLOSEST = DATA.iloc[DATA_INDEX.index]
Here's a solution. Please note that for units with only two rows, the choice of the value which is closest to the mean is arbitrary.
t = df.melt(id_vars="Unit")
t["distance_from_mean"] = t.groupby(["Unit", "variable"]).transform("mean").subtract(t.value, axis=0).abs()
t = t.loc[t.groupby(["Unit", "variable"], as_index=False)["distance_from_mean"].idxmin()]
res = pd.pivot_table(t, columns="variable", index = "Unit", values="value")
print(res)
The output is:
variable Test1 Test2 Test3
Unit
Unit1 0.254279 0.154220 0.738429
Unit2 0.044309 0.134818 0.153882
Unit3 0.410137 0.661168 0.683008
If some of the units only have NaN values for a given test, the code needs to be slightly modified. See below (with an example of such a dataframe):
df = pd.DataFrame({
"Unit": ["Unit1"] * 3 + ["Unit2"] * 2 + ["Unit3"] * 5,
"Test1": range(10),
"Test2": range(10, 20),
"Test3": range(20,30)
})
df.loc[3:4, "Test2"] = np.NaN
print(df)
==>
Unit Test1 Test2 Test3
0 Unit1 0 10.0 20
...
3 Unit2 3 NaN 23
4 Unit2 4 NaN 24
5 Unit3 5 15.0 25
...
The code you're looking for:
t = df.melt(id_vars="Unit")
t["distance_from_mean"] = t.groupby(["Unit", "variable"]).transform("mean").subtract(t.value, axis=0).abs()
indices = t.groupby(["Unit", "variable"], as_index=False)["distance_from_mean"].idxmin()
indices.dropna(inplace = True)
t = t.loc[indices]
res = pd.pivot_table(t, columns="variable", index = "Unit", values="value")
print(res)
==>
variable Test1 Test2 Test3
Unit
Unit1 1.0 11.0 21.0
Unit2 3.0 NaN 23.0
Unit3 7.0 17.0 27.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.