简体   繁体   English

熊猫-如何设置选择删除重复项的规则

[英]Pandas - How can I set rules for selecting which duplicates to drop

I have a dataset with values in the column and in the index datetime objects. 我有一个在列和索引datetime对象中具有值的数据集。 What I want to do is drop the values that have the same index (date&time), but I want to make a rule like: 我想做的是删除具有相同索引(日期和时间)的值,但是我想制定一条规则,例如:


I have two values for the same datetime, so I want to pick the one closer to number X. That number could be for example the mean value of the whole dataset. 我在同一日期时间有两个值,因此我想选择一个更接近数字X的值。例如,该数字可能是整个数据集的平均值。


The dataset looks like this (I did reset the index when trying to make it happen because I got an error for having multiple indices): 数据集看起来像这样(我尝试重置索引时确实重置了索引,因为出现多个索引时出现错误):

         index                  kwh
16391   2014-10-26 03:14:59     0.0514139
16392   2014-10-26 03:29:59     0.0323344
16393   2014-10-26 03:29:59     12.3
16394   2014-10-26 03:44:59     0.0595618
16395   2014-10-26 03:59:59     0.0338677

if X (for example mean value) is 0.05 then what I want to get back is 如果X(例如平均值)为0.05,那么我想返回的是

16391   2014-10-26 03:14:59     0.0514139
16392   2014-10-26 03:29:59     0.0323344
16393   2014-10-26 03:44:59     0.0595618
16394   2014-10-26 03:59:59     0.0338677

I have tried using groupby and apply in several different ways but I cant get it to work. 我尝试使用groupby并以几种不同的方式应用,但是我无法使其正常工作。 Any help please? 有什么帮助吗?

If you add a dist column to the DataFrame which measures the absolute distance between kwh and X : 如果将dist列添加到DataFrame中,该列用于测量kwhX之间的绝对距离:

X = df['kwh'].mean()
df['dist'] = (df['kwh'] - X).abs()

then you can groupby index and find the unique integer indices with the minimum dist for each group: 那么你可以GROUPBY index和找到最小的唯一的整数索引dist每个组:

idx = df.groupby(['index'])['dist'].transform(lambda x: x == x.min()).astype(bool)

Then you can select those rows using df.loc : 然后,您可以使用df.loc选择这些行:

df.loc[idx]

If data contains (note the duplicate values of kwh for the same index ): 如果data包含(请注意同一indexkwh的重复值):

         index                  kwh
16391   2014-10-26 03:14:59     0.0514139
16392   2014-10-26 03:29:59     0.0323344
16392   2014-10-26 03:29:59     0.0323344
16393   2014-10-26 03:29:59     12.3
16394   2014-10-26 03:44:59     0.0595618
16395   2014-10-26 03:59:59     0.0338677

then 然后

import pandas as pd

df = pd.read_table('data', sep='\s{2,}')
print(df)
X = df['kwh'].mean()
df['dist'] = (df['kwh'] - X).abs()
idx = df.groupby(['index'])['dist'].transform(lambda x: x == x.min()).astype(bool)
print(df.loc[idx])

yields 产量

                     index       kwh      dist
16391  2014-10-26 03:14:59  0.051414  2.033505
16392  2014-10-26 03:29:59  0.032334  2.052584
16392  2014-10-26 03:29:59  0.032334  2.052584
16394  2014-10-26 03:44:59  0.059562  2.025357
16395  2014-10-26 03:59:59  0.033868  2.051051

Note that by using transform here, we get a boolean mask which allows us to select all rows -- including those with duplicate values of kwh -- which have the minimum distance from X . 请注意,通过在此处使用transform ,我们得到一个布尔蒙版,它使我们能够选择所有X距离最小的 ,包括kwh值重复的


You could use del df['dist'] to drop the dist column when you no longer need it. 您可以在不再需要时使用del df['dist']删除dist列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM