简体   繁体   English

R:列集中的最后一个非NA值

[英]R: Last non-NA value among column sets

I am looking for a solution to the problem below that would be supported in pipes. 我正在寻找下面的问题的解决方案,将在管道中支持。

I have data that looks like this: 我的数据看起来像这样:

tibble(
  column_set_1_1 = c(1, 2, 3), column_set_1_2 = c(2, 3, NA), column_set_1_3 = c(3, NA, NA),
  column_set_2_1 = c(1, 2, 3), column_set_2_2 = c(4, 5, 6), column_set_2_3 = c(7, 8, 9), 
  column_set_2_4 = c(10, 11, NA), column_set_2_5 = c(13, NA, NA), column_set_2_6 = c(NA, NA, NA)
)

# A tibble: 3 × 9
  column_set_1_1 column_set_1_2 column_set_1_3 column_set_2_1 column_set_2_2 column_set_2_3 column_set_2_4 column_set_2_5 column_set_2_6
           <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <lgl>
1              1              2              3              1              4              7             10             13             NA
2              2              3             NA              2              5              8             11             NA             NA
3              3             NA             NA              3              6              9             NA             NA             NA

I am basically looking to get the last non-NA value by column set. 我基本上希望按列集获取最后一个非NA值。 So, the expected output is: 所以,预期的输出是:

tibble(
  column_set_1 = c(3, 3, 3), 
  column_set_2 = c(13, 11, 9)
)

# A tibble: 3 × 2
  column_set_1 column_set_2
         <dbl>        <dbl>
1            3           13
2            3           11
3            3            9

Here is a tidyverse approach without reshaping the original data frame but split it into groups by the column names pattern, and use coalesce function to get the last non-NA values in each sub data frame: 这是一个tidyverse方法,不重新整形原始数据帧,但按列名称模式将其拆分为组,并使用coalesce函数获取每个子数据帧中的最后一个非NA值:

library(tidyverse)
df_foo %>% 
      mutate_all(as.numeric) %>% 
      split.default(f = sub("_\\d+$", "", names(.))) %>% 
      map_df(~do.call(coalesce, setNames(rev(.), NULL)))

# A tibble: 3 × 2
#  column_set_1 column_set_2
#         <dbl>        <dbl>
#1            3           13
#2            3           11
#3            3            9

Here is my solution using tidyverse tools: 这是我使用tidyverse工具的解决方案:

library(dplyr)
library(tidyr)
library(stringr)
library(tibble)

get_last_nonNA <- function(vec) {
  return(last(vec[!is.na(vec)]))
}

convert_table_last_nonNA <- . %>%
  rownames_to_column() %>%
  gather(key=column_type, value=value, -rowname) %>%
  mutate(column_set=str_extract(string=column_type,
                                pattern="[0-9]+")) %>%
  group_by(column_set, rowname) %>%
  summarise(last_nonNA_value=get_last_nonNA(value)) %>%
  spread(key=column_set, value=last_nonNA_value) %>%
  select(-rowname) %>%
  select(colnames(.) %>% as.integer() %>% order()) %>%
  "colnames<-"(paste0("column_set_", colnames(.)))
# Usage
data_tbl <- tibble(
  column_set_1_1 = c(1, 2, 3), column_set_1_2 = c(2, 3, NA),
  column_set_1_3 = c(3, NA, NA), column_set_2_1 = c(1, 2, 3),
  column_set_2_2 = c(4, 5, 6), column_set_2_3 = c(7, 8, 9), 
  column_set_2_4 = c(10, 11, NA), column_set_2_5 = c(13, NA, NA),
  column_set_2_6 = c(NA, NA, NA)
)

convert_table_last_nonNA(data_tbl)

# # A tibble: 3 × 2
#   column_set_1 column_set_2
# *        <dbl>        <dbl>
# 1            3           13
# 2            3           11
# 3            3            9

What it does, step by step: 它是做什么的,一步一步:

  1. Creates a reusable pipe with convert_table_last_nonNA <- . %>% 使用convert_table_last_nonNA <- . %>%创建可重用的管道convert_table_last_nonNA <- . %>% convert_table_last_nonNA <- . %>% ; convert_table_last_nonNA <- . %>% ;
  2. Adds row names to the separate column with rownames_to_column() in order to have information for extracting the last non-NA data per row; 使用rownames_to_column()将行名称添加到单独的列,以便获得用于提取每行的最后一个非NA数据的信息;
  3. Transforms input table into long format with gather(key=column_type, value=value, -rowname) : the rows represent now a combination of key columns ( rowname and column_type ) and value ( value ); 使用gather(key=column_type, value=value, -rowname)将输入表转换为长格式:行现在表示键列( rownamecolumn_type )和值( value )的组合;
  4. Computes column's set number via regular expression magic (extracts the first number from column_type strings) and stores it in separate column column_set . 通过正则表达式魔术计算列的集合编号(从column_type字符串中提取第一个数字)并将其存储在单独的列column_set This is done with mutate(column_set=str_extract(string=column_type, pattern="[0-9]+")) ; 这是通过mutate(column_set=str_extract(string=column_type, pattern="[0-9]+"))完成的mutate(column_set=str_extract(string=column_type, pattern="[0-9]+")) ;
  5. Summarises the data in needed fashion with group_by(column_set, rowname) %>% summarise(last_nonNA_value=get_last_nonNA(value)) . 使用group_by(column_set, rowname) %>% summarise(last_nonNA_value=get_last_nonNA(value))以所需方式汇总数据。 That is "for every combination of column_set and rowname give the last nonNA value of value (via get_last_nonNA call) and stores it in column last_nonNA_value ". 这是“对的每个组合column_setrowname给出的最后nonNA值value (通过get_last_nonNA调用),并将其存储在列last_nonNA_value ”。 Note : if there are only NA 's for some combination of column_set and rowname the result will be NA; 注意 :如果对于column_setrowname的某种组合只有NA ,则结果为NA;
  6. Transforms table in wide format with spread(key=column_set, value=last_nonNA_value) . 使用spread(key=column_set, value=last_nonNA_value)以宽格式转换表格。 Now there is a column for every item in column_set and their values are last_nonNA_value s; 现在,没有为每个项目列在column_set和它们的值last_nonNA_value S;
  7. Deletes column rowname because it is not needed any more; 删除列rowname因为不再需要它;
  8. Reorders columns in order of increasing number of column_set. 按列增加的顺序重新排序列。 It is needed because if there are more then 9 column sets in your original data then there will be some confusion with ordering columns (that is column column_set_10 will be placed directly after column_set_1 ). 这是必要的,因为如果原始数据中有超过9个列集,则会对排序列产生一些混淆(列column_set_10将直接放在column_set_1之后)。 This is done with select(colnames(.) %>% as.integer() %>% order()) ; 这是通过select(colnames(.) %>% as.integer() %>% order()) ;
  9. Adds prefix column_set_ to column names with "colnames<-"(paste0("column_set_", colnames(.))) . 添加前缀column_set_列名与"colnames<-"(paste0("column_set_", colnames(.)))

Here is a solution that I came up with that works with pipes: 以下是我提出的可与管道配合使用的解决方案:

df_foo %>% 
  gather(key = Key, value = Value, -ID) %>% 
  mutate(set = str_extract(Key, "column_set_[0-9]")) %>% 
  mutate(number = str_extract(Key, "(?<=column_set_[0-9]_)[0-9]+")) %>% 
  group_by(ID, set) %>% 
  dplyr::filter(!is.na(Value)) %>%
  arrange(number) %>% 
  slice(n()) %>% 
  select(-number, -Key) %>% 
  spread(key = set, value = Value)

I don't like the fact that I have to arrange and then slice out the last row -- seems inelegant to me. 我不喜欢我必须arrange然后slice出最后一排的事实 - 对我来说似乎不优雅。 Any improvements welcome. 欢迎任何改进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM