简体   繁体   English

根据列的最大值过滤数据框

[英]Filter a data frame based on the maximum value of a column

I have a data frame that has 10 columns and "n" rows (a lot of rows).我有一个包含 10 列和“n”行(很多行)的数据框。

The idea of the data frame is simple: it gets the market expectation for the exchange rate for the date "t" at a time "d".数据框的想法很简单:它获取市场对日期“t”和时间“d”的汇率的预期。 For example: today we have the market expectation for the exchange rate at january/23, february/23, and so on.... (this data frame brings the expectations on a monthly basis, with daily updates)例如:今天我们有市场对 1 月 23 日、2 月 23 日的汇率预期......(这个数据框带来了每月的预期,每日更新)

This data frame has the historical estimates since january/2019, for dates up to december/2023.该数据框具有自 2019 年 1 月以来的历史估计,日期截至 2023 年 12 月。

So to summarize, we have a "date_of_estimate" column, and an "estimation_reference" column.总而言之,我们有一个“date_of_estimate”列和一个“estimation_reference”列。

The thing is, I want to filter this huge data frame to get the most updated value for all the monthly estimates since 01-01-2019.问题是,我想过滤这个巨大的数据框,以获得自 2019 年 1 月 1 日以来所有月度估算的最新值。

So the code should work just as a maxif function, where it gets the highest value of the "date_of_estimate" column based on the "estimation_reference" value.因此代码应该像 maxif 函数一样工作,它根据“estimation_reference”值获取“date_of_estimate”列的最大值。 The "estimation_reference" can be also interpreted as a string, like "Group_A","Group_B", etc... “estimation_reference”也可以解释为字符串,如“Group_A”、“Group_B”等...

How do I get the structure im looking for?我如何获得我正在寻找的结构? I'm not very familiar to R and this is an important work routine that has just fallen on my lap...我对 R 不是很熟悉,这是一个重要的工作例程,刚刚落在我的腿上......

Thanks in advance提前致谢

My first guess was to use the aggregate function, the code I used was this one:我的第一个猜测是使用聚合函数,我使用的代码是这样的:

`Cambio_PorDataRef = aggregate(base_cambio, by = list(base_cambio$Data), max)`

Where base_cambio is the raw data frame containing all the dates and estimates, base_cambio$data is the "date_of_estimate" column I mentioned above.其中 base_cambio 是包含所有日期和估计值的原始数据框,base_cambio$data 是我上面提到的“date_of_estimate”列。

The result is: enter image description here结果是:在此处输入图像描述

The "data_referencia" column should be composed by unique values, where the "date_of_estimate" (Group 1 column in the image) should be the most updated (latest date available for this estimate) but it is bringing repeated values, and the values dont seem to make sense, as it should begin in 01/2021 and progress month by month until 12/2023 (ie dec/23). “data_referencia”列应由唯一值组成,其中“date_of_estimate”(图像中的第 1 组列)应该是最新的(可用于此估计的最新日期)但它带来重复值,并且这些值似乎说得通,因为它应该从 01/2021 开始,然后逐月进行,直到 12/2023(即 12 月 23 日)。

By running dput(head(base_cambio,20)) I got:通过运行 dput(head(base_cambio,20)) 我得到:

structure(list(Indicador = c("Câmbio", "Câmbio", "Câmbio", "Câmbio", 
"Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", 
"Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", 
"Câmbio", "Câmbio"), Data = structure(c(18655, 18654, 18653, 
18652, 18649, 18648, 18647, 18646, 18645, 18642, 18641, 18640, 
18639, 18638, 18635, 18634, 18633, 18632, 18631, 18683), class = "Date"), 
    DataReferencia = c("01/2021", "01/2021", "01/2021", "01/2021", 
    "01/2021", "01/2021", "01/2021", "01/2021", "01/2021", "01/2021", 
    "01/2021", "01/2021", "01/2021", "01/2021", "01/2021", "01/2021", 
    "01/2021", "01/2021", "01/2021", "02/2021"), Media = c(5.3, 
    5.29, 5.29, 5.29, 5.28, 5.25, 5.25, 5.25, 5.24, 5.24, 5.22, 
    5.21, 5.21, 5.19, 5.17, 5.14, 5.14, 5.13, 5.13, 5.38), Mediana = c(5.3, 
    5.3, 5.3, 5.3, 5.3, 5.25, 5.25, 5.25, 5.25, 5.25, 5.21, 5.2, 
    5.2, 5.16, 5.15, 5.15, 5.15, 5.14, 5.13, 5.4), DesvioPadrao = c(0.11, 
    0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.11, 0.14, 0.14, 
    0.15, 0.15, 0.15, 0.14, 0.13, 0.13, 0.13, 0.07), Minimo = c(4.85, 
    4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 
    4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 5), Maximo = c(5.62, 
    5.62, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.49, 5.49, 
    5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.52), numeroRespondentes = c(102L, 
    102L, 100L, 99L, 99L, 95L, 97L, 97L, 97L, 98L, 92L, 92L, 
    90L, 89L, 90L, 91L, 90L, 90L, 89L, 107L), baseCalculo = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

The data frame looks like this:数据框如下所示:

原始数据的前 6 行

The function is expected to look for the highest (latest) value on the second column ( Data ) for every unique occurence in the third column ( DataReferência ).对于第三列 ( DataReferência ) 中的每个唯一出现,该函数应在第二列 ( Data ) 上查找最高(最新)值。 As this data frame only has one unique value, the entire first row should be the ouput of the code im looking for, as it has the highest value on the second column.由于此数据框只有一个唯一值,因此整个第一行应该是我正在查找的代码的输出,因为它在第二列中具有最高值。 The code should be able to do the same for every unique value on the third column and gather it all in a new data frame, filtered, with all the columns of the original data frame.该代码应该能够对第三列中的每个唯一值执行相同的操作,并将其全部收集到一个新的数据框中,过滤后,包含原始数据框的所有列。

Output should be:输出应该是:

示例的期望输出

You want dplyr::slice_max() :你想要dplyr::slice_max()

library(dplyr)

base_cambio_recent <- base_cambio %>% 
  group_by(DataReferencia) %>% 
  slice_max(Data) %>% 
  ungroup()

Or a base R approach:或者基本的 R 方法:

base_cambio_recent <- base_cambio[rev(order(base_cambio$Data)), ] 
base_cambio_recent <- lapply(
  split(base_cambio_recent, base_cambio_recent$DataReferencia),
  \(x) head(x, 1)
)
base_cambio_recent <- do.call(rbind, base_cambio_recent)

Result from either approach:两种方法的结果:

# A tibble: 2 × 10
  Indicador Data       DataReferencia Media Mediana DesvioPadrao Minimo Maximo numeroRespondentes baseCalculo
  <chr>     <date>     <chr>          <dbl>   <dbl>        <dbl>  <dbl>  <dbl>              <int>       <int>
1 Câmbio    2021-01-28 01/2021         5.3      5.3         0.11   4.85   5.62                102           0
2 Câmbio    2021-02-25 02/2021         5.38     5.4         0.07   5      5.52                107           0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM