简体   繁体   English

在 dataframe 中找到最接近中位数的值

[英]Find the value closest to median in a dataframe

I just pick up python couple months ago and new to this forum.几个月前我刚拿起 python 并且是这个论坛的新手。 Appreciate if anyone can help.感谢是否有人可以提供帮助。 I want to find the value closest to the median.我想找到最接近中位数的值。

  1. To get the median for each unit, I am using groupby and median().为了获得每个单元的中位数,我使用 groupby 和 median()。
  2. Then get the difference from the original dataframe然后得到与原厂dataframe的区别
  3. Use abs() and idxmin() to get the least delta.使用 abs() 和 idxmin() 来获得最小的增量。 Basically I end up with another dataframe that has the index of closest value to the median.基本上我最终得到另一个 dataframe ,它的索引值最接近中位数。 How do I proceed to use the index to get the actual value?如何继续使用索引来获取实际值?
Unit    Test1   Test2   Test3
Unit1   0.254279388 0.010388754 0.820704593
Unit1   0.957139807 0.207681463 0.738428693
Unit1   0.043462803 0.154220478 0.606568744
Unit2   0.044308884 0.134817932 0.697317637
Unit2   0.244895686 0.909262442 0.153881824
Unit3   0.368147792 0.735655648 0.200679595
Unit3   0.30457518  0.929519313 0.823938759
Unit3   0.537633836 0.661168043 0.736937724
Unit3   0.410137495 0.567494043 0.68300754
Unit3   0.525483757 0.556830631 0.988314575

to

Unit    Test1   Test2   Test3
Unit1   0.254279388 0.154220478 0.738428693
Unit2   0.144602285 0.522040187 0.425599731
Unit3   0.410137495 0.661168043 0.736937724

here is the snippet of the code.这是代码片段。 Each column should have its own index, but iloc uses the first index for all columns每列都应该有自己的索引,但 iloc 使用所有列的第一个索引

DATA_MEDIAN = DATA.groupby('Unit').median() DATA_MEDIAN = DATA.groupby('Unit').median()

DATA_INDEX = (DATA.set_index(['Unit']) - DATA_MEDIAN).abs().reset_index().groupby('Unit').idxmin() DATA_INDEX = (DATA.set_index(['Unit']) - DATA_MEDIAN).abs().reset_index().groupby('Unit').idxmin()

DATA_INDEX.reset_index(inplace=True) DATA_INDEX.reset_index(就地=真)

DATA_CLOSEST = DATA.iloc[DATA_INDEX.index] DATA_CLOSEST = DATA.iloc[DATA_INDEX.index]

Here's a solution.这是一个解决方案。 Please note that for units with only two rows, the choice of the value which is closest to the mean is arbitrary.请注意,对于只有两行的单位,选择最接近平均值的值是任意的。

t = df.melt(id_vars="Unit")
t["distance_from_mean"] = t.groupby(["Unit", "variable"]).transform("mean").subtract(t.value, axis=0).abs()

t = t.loc[t.groupby(["Unit", "variable"], as_index=False)["distance_from_mean"].idxmin()]

res = pd.pivot_table(t, columns="variable", index = "Unit", values="value")
print(res)

The output is: output 是:

variable     Test1     Test2     Test3
Unit                                  
Unit1     0.254279  0.154220  0.738429
Unit2     0.044309  0.134818  0.153882
Unit3     0.410137  0.661168  0.683008

If some of the units only have NaN values for a given test, the code needs to be slightly modified.如果某些单元对于给定的测试只有 NaN 值,则需要稍微修改代码。 See below (with an example of such a dataframe):见下文(以此类数据框为例):

df = pd.DataFrame({
    "Unit": ["Unit1"] * 3 + ["Unit2"] * 2 + ["Unit3"] * 5, 
    "Test1": range(10), 
    "Test2": range(10, 20), 
    "Test3": range(20,30)
})
df.loc[3:4, "Test2"] = np.NaN
print(df)

==>
    Unit  Test1  Test2  Test3
0  Unit1      0   10.0     20
...
3  Unit2      3    NaN     23
4  Unit2      4    NaN     24
5  Unit3      5   15.0     25
...

The code you're looking for:您正在寻找的代码:

t = df.melt(id_vars="Unit")
t["distance_from_mean"] = t.groupby(["Unit", "variable"]).transform("mean").subtract(t.value, axis=0).abs()

indices = t.groupby(["Unit", "variable"], as_index=False)["distance_from_mean"].idxmin()
indices.dropna(inplace = True)

t = t.loc[indices]

res = pd.pivot_table(t, columns="variable", index = "Unit", values="value")
print(res)

==>

variable  Test1  Test2  Test3
Unit                         
Unit1       1.0   11.0   21.0
Unit2       3.0    NaN   23.0
Unit3       7.0   17.0   27.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM