[英]Find the value closest to median in a dataframe
I just pick up python couple months ago and new to this forum.几个月前我刚拿起 python 并且是这个论坛的新手。 Appreciate if anyone can help.感谢是否有人可以提供帮助。 I want to find the value closest to the median.我想找到最接近中位数的值。
Unit Test1 Test2 Test3
Unit1 0.254279388 0.010388754 0.820704593
Unit1 0.957139807 0.207681463 0.738428693
Unit1 0.043462803 0.154220478 0.606568744
Unit2 0.044308884 0.134817932 0.697317637
Unit2 0.244895686 0.909262442 0.153881824
Unit3 0.368147792 0.735655648 0.200679595
Unit3 0.30457518 0.929519313 0.823938759
Unit3 0.537633836 0.661168043 0.736937724
Unit3 0.410137495 0.567494043 0.68300754
Unit3 0.525483757 0.556830631 0.988314575
to至
Unit Test1 Test2 Test3
Unit1 0.254279388 0.154220478 0.738428693
Unit2 0.144602285 0.522040187 0.425599731
Unit3 0.410137495 0.661168043 0.736937724
here is the snippet of the code.这是代码片段。 Each column should have its own index, but iloc uses the first index for all columns每列都应该有自己的索引,但 iloc 使用所有列的第一个索引
DATA_MEDIAN = DATA.groupby('Unit').median() DATA_MEDIAN = DATA.groupby('Unit').median()
DATA_INDEX = (DATA.set_index(['Unit']) - DATA_MEDIAN).abs().reset_index().groupby('Unit').idxmin() DATA_INDEX = (DATA.set_index(['Unit']) - DATA_MEDIAN).abs().reset_index().groupby('Unit').idxmin()
DATA_INDEX.reset_index(inplace=True) DATA_INDEX.reset_index(就地=真)
DATA_CLOSEST = DATA.iloc[DATA_INDEX.index] DATA_CLOSEST = DATA.iloc[DATA_INDEX.index]
Here's a solution.这是一个解决方案。 Please note that for units with only two rows, the choice of the value which is closest to the mean is arbitrary.请注意,对于只有两行的单位,选择最接近平均值的值是任意的。
t = df.melt(id_vars="Unit")
t["distance_from_mean"] = t.groupby(["Unit", "variable"]).transform("mean").subtract(t.value, axis=0).abs()
t = t.loc[t.groupby(["Unit", "variable"], as_index=False)["distance_from_mean"].idxmin()]
res = pd.pivot_table(t, columns="variable", index = "Unit", values="value")
print(res)
The output is: output 是:
variable Test1 Test2 Test3
Unit
Unit1 0.254279 0.154220 0.738429
Unit2 0.044309 0.134818 0.153882
Unit3 0.410137 0.661168 0.683008
If some of the units only have NaN values for a given test, the code needs to be slightly modified.如果某些单元对于给定的测试只有 NaN 值,则需要稍微修改代码。 See below (with an example of such a dataframe):见下文(以此类数据框为例):
df = pd.DataFrame({
"Unit": ["Unit1"] * 3 + ["Unit2"] * 2 + ["Unit3"] * 5,
"Test1": range(10),
"Test2": range(10, 20),
"Test3": range(20,30)
})
df.loc[3:4, "Test2"] = np.NaN
print(df)
==>
Unit Test1 Test2 Test3
0 Unit1 0 10.0 20
...
3 Unit2 3 NaN 23
4 Unit2 4 NaN 24
5 Unit3 5 15.0 25
...
The code you're looking for:您正在寻找的代码:
t = df.melt(id_vars="Unit")
t["distance_from_mean"] = t.groupby(["Unit", "variable"]).transform("mean").subtract(t.value, axis=0).abs()
indices = t.groupby(["Unit", "variable"], as_index=False)["distance_from_mean"].idxmin()
indices.dropna(inplace = True)
t = t.loc[indices]
res = pd.pivot_table(t, columns="variable", index = "Unit", values="value")
print(res)
==>
variable Test1 Test2 Test3
Unit
Unit1 1.0 11.0 21.0
Unit2 3.0 NaN 23.0
Unit3 7.0 17.0 27.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.