简体   繁体   English

根据日期列按组获取最新的非 NA 值

[英]Get the latest non-NA value based on date column by group

I have a dataframe having country_name , date and several columns: column_1 , column_2 and column_3 .我有一个 dataframe 有country_namedate和几个列: column_1column_2column_3 I am trying to extract the latest record based on date across several columns.我正在尝试根据跨多个列的日期提取最新记录。

The dataframe looks like this: dataframe 看起来像这样:

| country_name | date        | column_1| column_2| column_3|
| US           | 2016-11-02  | 7.5     | NA      | NA      |
| US           | 2017-09-12  | NA      | NA      | 9       |
| US           | 2017-09-19  | NA      | 8       | 10      |
| US           | 2020-02-10  | 10      | NA      | NA      |
| US           | 2021-03-10  | NA      | NA      | 7.3     |
| US           | 2021-05-02  | NA      | 3       | NA      |
| UK           | 2016-11-02  | NA      | 2       | NA      |
| UK           | 2017-09-12  | 0.5     | 3       | NA      |
 .
 .

For the US the desired output is:对于美国,所需的 output 是:

| country_name | column_1| column_2| column_3|
| US           | 10      | 3       | 7.3     |

For column_1 , the value with the latest date is 10 (date: 2020-02-10), for column_2 is 3 (date: 2021-05-02), and for column_3 is 7.3 (date: 2021-03-10).对于column_1 ,最新日期的值为 10(日期:2020-02-10),对于column_2为 3(日期:2021-05-02),对于column_3为 7.3(日期:2021-03-10)。 My goal is to apply this logic across several countries.我的目标是在多个国家/地区应用此逻辑。 How do I achieve this?我如何实现这一目标?

library(dplyr)
library(tidyr)

df1 %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(country_name) %>%
  arrange(date) %>%
  select(-date) %>% 
  fill(everything()) %>% 
  slice(n())

#> # A tibble: 2 x 4
#> # Groups:   country_name [2]
#>   country_name column_1 column_2 column_3
#>   <chr>           <dbl>    <int>    <dbl>
#> 1 UK                0.5        3     NA  
#> 2 US               10          3      7.3

Data:数据:

read.table(text = "country_name  date         column_1 column_2 column_3
                   US            2016-11-02   7.5      NA       NA      
                   US            2017-09-12   NA       NA       9       
                   US            2017-09-19   NA       8        10      
                   US            2020-02-10   10       NA       NA      
                   US            2021-03-10   NA       NA       7.3     
                   US            2021-05-02   NA       3        NA      
                   UK            2016-11-02   NA       2        NA      
                   UK            2017-09-12   0.5      3        NA", 
           header = T, stringsAsFactors = F) -> df1

Update:更新:

Thanks to @Darren Tsai handling the warning:感谢@Darren Tsai 处理警告:

Warning: Problem while computing `..1 = across(-country_name, ~parse_number(.)).
i 1 parsing failure. row col expected actual 1 -- a number NA NA

Adding this line of code:添加这行代码:

 mutate(across(-country_name, ~str_trim(str_replace_all(., 'NA', ''))))
library(tidyverse)
library(lubridate)

df1 %>% 
  mutate(date = ymd(date)) %>% 
  group_by(country_name) %>%
  arrange(date, .by_group = TRUE) %>% 
  summarise(across(starts_with("column"), ~paste(rev(.), collapse = ' '))) %>% 
  mutate(across(-country_name, ~str_trim(str_replace_all(., 'NA', '')))) %>% 
  mutate(across(-country_name, ~parse_number(.)))
  country_name column_1 column_2 column_3
  <chr>           <dbl>    <dbl>    <dbl>
1 UK                0.5        3     NA  
2 US               10          3      7.3

First answer:第一个答案:

Here is how we could do it:这是我们如何做到的:

  1. If necessary transform date column to date class with ymd() function from lubridate .如有必要,使用 lubridate 中的ymd() function 将date列转换为日期lubridate
  2. group by country_namecountry_name
  3. Now comes the trick we use across for col1 col2... etc. and collapse in reverse with paste(rev(.).... to get the last value to first place. This is important for the next step.现在是我们对 col1 col2... 等使用across技巧,并使用paste(rev(.)....反向折叠以将最后一个值放在第一位。这对下一步很重要。
  4. Use parse_number() from readr package that will extract the first number!使用readr package 中的parse_number()将提取第一个数字!
library(dplyr)
library(lubridate)
library(readr)

df %>% 
  mutate(date = ymd(date)) %>% 
  group_by(country_name) %>%
  arrange(date, .by_group = TRUE) %>% 
  summarise(across(starts_with("column"), ~paste(rev(.), collapse = ' '))) %>% 
  mutate(across(-country_name, parse_number))

 country_name column_1 column_2 column_3
  <chr>           <dbl>    <dbl>    <dbl>
1 UK                0.5        3     NA  
2 US               10          3      7.3

You could na.omit and rev erse each column and take first el ement.您可以na.omitrev每一列并获取第el元素。 Then rbind .然后rbind Take care of the right order and if it's as.Date formatted.注意正确的order ,如果它是as.Date格式。

by(transform(dat, date=as.Date(date)), dat$country_name, \(x) {
  cbind(x[1, 1, drop=FALSE], 
        lapply(x[order(x$date), 3:5], \(z) {
          z <- el(rev(na.omit(z)))
          ifelse(length(z) == 1, z, NA_real_)
        }))
}) |> c(make.row.names=FALSE) |> do.call(what=rbind)
#   country_name column_1 column_2 column_3
# 1           UK      0.5        3       NA
# 2           US     10.0        3      7.3

Data:数据:

dat <- structure(list(country_name = c("US", "US", "US", "US", "US", 
"US", "UK", "UK"), date = c("2016-11-02", "2017-09-12", "2017-09-19", 
"2020-02-10", "2021-03-10", "2021-05-02", "2016-11-02", "2017-09-12"
), column_1 = c(7.5, NA, NA, 10, NA, NA, NA, 0.5), column_2 = c(NA, 
NA, 8L, NA, NA, 3L, 2L, 3L), column_3 = c(NA, 9, 10, NA, 7.3, 
NA, NA, NA)), class = "data.frame", row.names = c(NA, -8L))

You can summarise each country across multiple columns with across() .您可以使用across()跨多个列汇总每个国家/地区。 The latest non-NA value can be subsetted by .x[date == max(date[.is.na(.x)])] .最新的非 NA 值可以通过.x[date == max(date[.is.na(.x)])]进行子集化。

library(dplyr)

df %>%
  group_by(country_name) %>%
  summarise(across(starts_with("column"),
                   ~ if(all(is.na(.x))) NA else .x[date == max(date[!is.na(.x)])])) %>%
  ungroup()

# # A tibble: 2 × 4
#   country_name column_1 column_2 column_3
#   <chr>           <dbl>    <int>    <dbl>
# 1 UK                0.5        3     NA  
# 2 US               10          3      7.3

Another idea:另一个想法:

df %>% 
  group_by(country_name) %>%
  arrange(desc(date), .by_group = TRUE) %>% 
  summarise(across(starts_with("column"), ~ .x[!is.na(.x)][1])) %>% 
  ungroup()

Here is a base R solution.这是一个基本的 R 解决方案。 It uses two sapply calls: one for country and one for column.它使用两个 sapply 调用:一个用于国家,一个用于列。

foo <- structure(list(country_name = c("US", "US", "US", "US", "US", 
"US", "UK", "UK"), date = c("2016-11-02", "2017-09-12", "2017-09-19", 
"2020-02-10", "2021-03-10", "2021-05-02", "2016-11-02", "2017-09-12"
), column_1 = c(7.5, NA, NA, 10, NA, NA, NA, 0.5), column_2 = c(NA, 
NA, 8L, NA, NA, 3L, 2L, 3L), column_3 = c(NA, 9, 10, NA, 7.3, 
NA, NA, NA)), class = "data.frame", row.names = c(NA, -8L))


split(foo, foo$country_name)|>
  sapply( function(s) {
    s = s[order(s$date),]
    sapply(s[,3:5], function(x) {
       y = na.omit(x)
       ifelse(length(y)> 0, y[length(y)], NA) })}) |>
  t()

#   column_1 column_2 column_3
#UK      0.5        3       NA
#US     10.0        3      7.3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM