简体   繁体   English

如何根据分位数按日期删除行?

[英]How to remove rows by date based on quantile?

My problem is the following: I would like to remove rows in a data frame which are lower than the 50th percentile defined for each date.我的问题如下:我想删除数据框中低于为每个日期定义的第 50 个百分位的行。 The following example illustrate my problem.下面的例子说明了我的问题。

I have the following data frame:我有以下数据框:

date <- c("01.02.2011","01.02.2011","01.02.2011","01.02.2011","01.02.2011","01.02.2011",
          "01.02.2011","01.02.2011","01.02.2011","01.02.2011",
          "02.02.2011","02.02.2011","02.02.2011","02.02.2011","02.02.2011","02.02.2011",
          "02.02.2011","02.02.2011","02.02.2011","02.02.2011")
date <- as.Date(date, format="%d.%m.%Y")
ID <- c("A","B","C","D","E","F","G","H","I","J",
        "A","B","C","D","E","F","G","H","I","J")
values <- as.numeric(c("1","8","2","3","5","13","2","4","1","16",
                       "4","2","12","16","8","1","7","11","2","10"))

df <- data.frame(ID, date, values)

Looking like this:看起来像这样:

   ID       date values
1   A 2011-02-01      1
2   B 2011-02-01      8
3   C 2011-02-01      2
4   D 2011-02-01      3
5   E 2011-02-01      5
6   F 2011-02-01     13
7   G 2011-02-01      2
8   H 2011-02-01      4
9   I 2011-02-01      1
10  J 2011-02-01     16
11  A 2011-02-02      4
12  B 2011-02-02      2
13  C 2011-02-02     12
14  D 2011-02-02     16
15  E 2011-02-02      8
16  F 2011-02-02      1
17  G 2011-02-02      7
18  H 2011-02-02     11
19  I 2011-02-02      2
20  J 2011-02-02     10

I would like to delete all the rows for each date where values are below the 50th percentile (defined by date) in order to obtain:我想删除值低于第 50 个百分位(由日期定义)的每个日期的所有行,以获得:

   ID       date values
2   B 2011-02-01      8
5   E 2011-02-01      5
6   F 2011-02-01     13
8   H 2011-02-01      4
10  J 2011-02-01     16
13  C 2011-02-02     12
14  D 2011-02-02     16
15  E 2011-02-02      8
18  H 2011-02-02     11
20  J 2011-02-02     10

If any editing of my question is needed, do not hesitate to let me know如果需要对我的问题进行任何编辑,请随时告诉我

You have several ways to do that.你有几种方法可以做到这一点。 Some solutions here but there exists much more way to do that.这里有一些解决方案,但还有更多方法可以做到这一点。 They all apply the same idea: first compute median by date, then filter your data.他们都采用相同的想法:首先按日期计算中位数,然后过滤您的数据。

data.table data.table

If you want to use data.table , first you update your data by reference using := then you filter.如果要使用data.table ,首先使用:=通过引用更新数据,然后进行过滤。 data.table is a very efficient approach if your dataset is voluminous.如果您的数据集很大, data.table是一种非常有效的方法。

library(data.table)
setDT(df)

df[, quant := quantile(values, probs = .5),by = "date"]
df2 <- df[values>quant]
df2[,'quant' := NULL]

df2
    ID       date values
 1:  B 2011-02-01      8
 2:  E 2011-02-01      5
 3:  F 2011-02-01     13
 4:  H 2011-02-01      4
 5:  J 2011-02-01     16
 6:  C 2011-02-02     12
 7:  D 2011-02-02     16
 8:  E 2011-02-02      8
 9:  H 2011-02-02     11
10:  J 2011-02-02     10

dplyr dplyr

With dplyr , you pipe your operations your operations: compute quantile by group and then filter使用dplyr ,您 pipe 您的操作您的操作:按组计算分位数,然后过滤

library(dplyr)
df %>%
   group_by(date) %>%
   mutate(quant = quantile(values, .5)) %>%
   filter(values>quant) %>%
   select(-quant)

Groups:   date [2]
   ID    date       values
   <fct> <date>      <dbl>
 1 B     2011-02-01      8
 2 E     2011-02-01      5
 3 F     2011-02-01     13
 4 H     2011-02-01      4
 5 J     2011-02-01     16
 6 C     2011-02-02     12
 7 D     2011-02-02     16
 8 E     2011-02-02      8
 9 H     2011-02-02     11
10 J     2011-02-02     10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM