[英]Pandas - How can I set rules for selecting which duplicates to drop
I have a dataset with values in the column and in the index datetime objects. 我有一个在列和索引datetime对象中具有值的数据集。 What I want to do is drop the values that have the same index (date&time), but I want to make a rule like:
我想做的是删除具有相同索引(日期和时间)的值,但是我想制定一条规则,例如:
I have two values for the same datetime, so I want to pick the one closer to number X. That number could be for example the mean value of the whole dataset. 我在同一日期时间有两个值,因此我想选择一个更接近数字X的值。例如,该数字可能是整个数据集的平均值。
The dataset looks like this (I did reset the index when trying to make it happen because I got an error for having multiple indices): 数据集看起来像这样(我尝试重置索引时确实重置了索引,因为出现多个索引时出现错误):
index kwh
16391 2014-10-26 03:14:59 0.0514139
16392 2014-10-26 03:29:59 0.0323344
16393 2014-10-26 03:29:59 12.3
16394 2014-10-26 03:44:59 0.0595618
16395 2014-10-26 03:59:59 0.0338677
if X (for example mean value) is 0.05 then what I want to get back is 如果X(例如平均值)为0.05,那么我想返回的是
16391 2014-10-26 03:14:59 0.0514139
16392 2014-10-26 03:29:59 0.0323344
16393 2014-10-26 03:44:59 0.0595618
16394 2014-10-26 03:59:59 0.0338677
I have tried using groupby and apply in several different ways but I cant get it to work. 我尝试使用groupby并以几种不同的方式应用,但是我无法使其正常工作。 Any help please?
有什么帮助吗?
If you add a dist
column to the DataFrame which measures the absolute distance between kwh
and X
: 如果将
dist
列添加到DataFrame中,该列用于测量kwh
和X
之间的绝对距离:
X = df['kwh'].mean()
df['dist'] = (df['kwh'] - X).abs()
then you can groupby index
and find the unique integer indices with the minimum dist
for each group: 那么你可以GROUPBY
index
和找到最小的唯一的整数索引dist
每个组:
idx = df.groupby(['index'])['dist'].transform(lambda x: x == x.min()).astype(bool)
Then you can select those rows using df.loc
: 然后,您可以使用
df.loc
选择这些行:
df.loc[idx]
If data
contains (note the duplicate values of kwh
for the same index
): 如果
data
包含(请注意同一index
的kwh
的重复值):
index kwh
16391 2014-10-26 03:14:59 0.0514139
16392 2014-10-26 03:29:59 0.0323344
16392 2014-10-26 03:29:59 0.0323344
16393 2014-10-26 03:29:59 12.3
16394 2014-10-26 03:44:59 0.0595618
16395 2014-10-26 03:59:59 0.0338677
then 然后
import pandas as pd
df = pd.read_table('data', sep='\s{2,}')
print(df)
X = df['kwh'].mean()
df['dist'] = (df['kwh'] - X).abs()
idx = df.groupby(['index'])['dist'].transform(lambda x: x == x.min()).astype(bool)
print(df.loc[idx])
yields 产量
index kwh dist
16391 2014-10-26 03:14:59 0.051414 2.033505
16392 2014-10-26 03:29:59 0.032334 2.052584
16392 2014-10-26 03:29:59 0.032334 2.052584
16394 2014-10-26 03:44:59 0.059562 2.025357
16395 2014-10-26 03:59:59 0.033868 2.051051
Note that by using transform
here, we get a boolean mask which allows us to select all rows -- including those with duplicate values of kwh
-- which have the minimum distance from X
. 请注意,通过在此处使用
transform
,我们得到一个布尔蒙版,它使我们能够选择所有与X
距离最小的行 ,包括kwh
值重复的行 。
You could use del df['dist']
to drop the dist column when you no longer need it. 您可以在不再需要时使用
del df['dist']
删除dist列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.