[英]R: Last non-NA value among column sets
I am looking for a solution to the problem below that would be supported in pipes. 我正在寻找下面的问题的解决方案,将在管道中支持。
I have data that looks like this: 我的数据看起来像这样:
tibble(
column_set_1_1 = c(1, 2, 3), column_set_1_2 = c(2, 3, NA), column_set_1_3 = c(3, NA, NA),
column_set_2_1 = c(1, 2, 3), column_set_2_2 = c(4, 5, 6), column_set_2_3 = c(7, 8, 9),
column_set_2_4 = c(10, 11, NA), column_set_2_5 = c(13, NA, NA), column_set_2_6 = c(NA, NA, NA)
)
# A tibble: 3 × 9
column_set_1_1 column_set_1_2 column_set_1_3 column_set_2_1 column_set_2_2 column_set_2_3 column_set_2_4 column_set_2_5 column_set_2_6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 1 2 3 1 4 7 10 13 NA
2 2 3 NA 2 5 8 11 NA NA
3 3 NA NA 3 6 9 NA NA NA
I am basically looking to get the last non-NA value by column set. 我基本上希望按列集获取最后一个非NA值。 So, the expected output is:
所以,预期的输出是:
tibble(
column_set_1 = c(3, 3, 3),
column_set_2 = c(13, 11, 9)
)
# A tibble: 3 × 2
column_set_1 column_set_2
<dbl> <dbl>
1 3 13
2 3 11
3 3 9
Here is a tidyverse
approach without reshaping the original data frame but split it into groups by the column names pattern, and use coalesce
function to get the last non-NA values in each sub data frame: 这是一个
tidyverse
方法,不重新整形原始数据帧,但按列名称模式将其拆分为组,并使用coalesce
函数获取每个子数据帧中的最后一个非NA值:
library(tidyverse)
df_foo %>%
mutate_all(as.numeric) %>%
split.default(f = sub("_\\d+$", "", names(.))) %>%
map_df(~do.call(coalesce, setNames(rev(.), NULL)))
# A tibble: 3 × 2
# column_set_1 column_set_2
# <dbl> <dbl>
#1 3 13
#2 3 11
#3 3 9
Here is my solution using tidyverse
tools: 这是我使用
tidyverse
工具的解决方案:
library(dplyr)
library(tidyr)
library(stringr)
library(tibble)
get_last_nonNA <- function(vec) {
return(last(vec[!is.na(vec)]))
}
convert_table_last_nonNA <- . %>%
rownames_to_column() %>%
gather(key=column_type, value=value, -rowname) %>%
mutate(column_set=str_extract(string=column_type,
pattern="[0-9]+")) %>%
group_by(column_set, rowname) %>%
summarise(last_nonNA_value=get_last_nonNA(value)) %>%
spread(key=column_set, value=last_nonNA_value) %>%
select(-rowname) %>%
select(colnames(.) %>% as.integer() %>% order()) %>%
"colnames<-"(paste0("column_set_", colnames(.)))
# Usage
data_tbl <- tibble(
column_set_1_1 = c(1, 2, 3), column_set_1_2 = c(2, 3, NA),
column_set_1_3 = c(3, NA, NA), column_set_2_1 = c(1, 2, 3),
column_set_2_2 = c(4, 5, 6), column_set_2_3 = c(7, 8, 9),
column_set_2_4 = c(10, 11, NA), column_set_2_5 = c(13, NA, NA),
column_set_2_6 = c(NA, NA, NA)
)
convert_table_last_nonNA(data_tbl)
# # A tibble: 3 × 2
# column_set_1 column_set_2
# * <dbl> <dbl>
# 1 3 13
# 2 3 11
# 3 3 9
What it does, step by step: 它是做什么的,一步一步:
convert_table_last_nonNA <- . %>%
convert_table_last_nonNA <- . %>%
创建可重用的管道convert_table_last_nonNA <- . %>%
convert_table_last_nonNA <- . %>%
; convert_table_last_nonNA <- . %>%
; rownames_to_column()
in order to have information for extracting the last non-NA data per row; rownames_to_column()
将行名称添加到单独的列,以便获得用于提取每行的最后一个非NA数据的信息; gather(key=column_type, value=value, -rowname)
: the rows represent now a combination of key columns ( rowname
and column_type
) and value ( value
); gather(key=column_type, value=value, -rowname)
将输入表转换为长格式:行现在表示键列( rowname
和column_type
)和值( value
)的组合; column_type
strings) and stores it in separate column column_set
. column_type
字符串中提取第一个数字)并将其存储在单独的列column_set
。 This is done with mutate(column_set=str_extract(string=column_type, pattern="[0-9]+"))
; mutate(column_set=str_extract(string=column_type, pattern="[0-9]+"))
完成的mutate(column_set=str_extract(string=column_type, pattern="[0-9]+"))
; group_by(column_set, rowname) %>% summarise(last_nonNA_value=get_last_nonNA(value))
. group_by(column_set, rowname) %>% summarise(last_nonNA_value=get_last_nonNA(value))
以所需方式汇总数据。 That is "for every combination of column_set
and rowname
give the last nonNA value of value
(via get_last_nonNA
call) and stores it in column last_nonNA_value
". column_set
和rowname
给出的最后nonNA值value
(通过get_last_nonNA
调用),并将其存储在列last_nonNA_value
”。 Note : if there are only NA
's for some combination of column_set
and rowname
the result will be NA; column_set
和rowname
的某种组合只有NA
,则结果为NA; spread(key=column_set, value=last_nonNA_value)
. spread(key=column_set, value=last_nonNA_value)
以宽格式转换表格。 Now there is a column for every item in column_set
and their values are last_nonNA_value
s; column_set
和它们的值last_nonNA_value
S; rowname
because it is not needed any more; rowname
因为不再需要它; column_set_10
will be placed directly after column_set_1
). column_set_10
将直接放在column_set_1
之后)。 This is done with select(colnames(.) %>% as.integer() %>% order())
; select(colnames(.) %>% as.integer() %>% order())
; column_set_
to column names with "colnames<-"(paste0("column_set_", colnames(.)))
. column_set_
列名与"colnames<-"(paste0("column_set_", colnames(.)))
Here is a solution that I came up with that works with pipes: 以下是我提出的可与管道配合使用的解决方案:
df_foo %>%
gather(key = Key, value = Value, -ID) %>%
mutate(set = str_extract(Key, "column_set_[0-9]")) %>%
mutate(number = str_extract(Key, "(?<=column_set_[0-9]_)[0-9]+")) %>%
group_by(ID, set) %>%
dplyr::filter(!is.na(Value)) %>%
arrange(number) %>%
slice(n()) %>%
select(-number, -Key) %>%
spread(key = set, value = Value)
I don't like the fact that I have to arrange
and then slice
out the last row -- seems inelegant to me. 我不喜欢我必须
arrange
然后slice
出最后一排的事实 - 对我来说似乎不优雅。 Any improvements welcome. 欢迎任何改进。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.