简体   繁体   English

计算数据框中的重复行和第一个非 NA 出现

[英]Count repeated rows and the first non-NA appearance in a data frame

I have the following sample dataset:我有以下示例数据集:

library(tidyverse)
dataset <- data.frame(id = c("A","A","B","B","C","A","C","B"), 
                  value = c(100, 500, 200, 100, 500, 300, 400, 100), 
                  status = c(NA, "Valid", NA, NA, "Pend", NA, NA, NA), 
                  stringsAsFactors = FALSE)

What I need is to extract this unique ids with the highest value and have how much it repeats and the first non-NA status.我需要的是提取这个具有最高值的唯一 ID,并有它重复的次数和第一个非 NA 状态。

I have solved it in this way:我是这样解决的:

dataset_count <- dataset %>% group_by(id) %>% 
summarise(count = n(), comment = max(status, na.rm = TRUE)) %>% ungroup()

dataset_cross <- dataset %>% arrange(desc(value)) %>% 
left_join(dataset_count) %>% distinct(id, .keep_all = TRUE)

but since my original dataset has 120 variables and more rules to follow I would like to know if there is a way to make it more compact.但由于我的原始数据集有 120 个变量和更多要遵循的规则,我想知道是否有办法使其更紧凑。 For example I read about coalesce, but it doesn't allow me to extract the first NA in a grouped data.例如,我阅读了有关合并的内容,但它不允许我提取分组数据中的第一个 NA。 Please, could you give some advice?拜托,你能给点建议吗? Thank you.谢谢你。

You could get max value using max , count number of rows using n() and first non-NA value with which.max for each id .您可以使用max获得最大值,使用n()计算行数,每个id使用which.max第一个非 NA 值。

library(dplyr)

dataset %>%
  group_by(id) %>%
  summarise(value = max(value), 
            count = n(), 
            status = status[which.max(!is.na(status))])

#  id    value count status
#  <chr> <dbl> <int> <chr> 
#1 A       500     3 Valid 
#2 B       200     3 NA    
#3 C       500     2 Pend  

Here is a base R solution这是一个基本的 R 解决方案

dfout <- do.call(rbind,
                 c(make.row.names = F,
                   lapply(split(dataset,dataset$id), 
                          function(v) {
                            data.frame(
                              id = unique(v["id"]),
                              value = max(v["value"]),
                              count = nrow(v),
                              status = v$status[which.max(!is.na(v$status))]
                            )
                          })))

such that以至于

> dfout
  id value count status
1  A   500     3  Valid
2  B   200     3   <NA>
3  C   500     2   Pend

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 查找数据框中特定列的第一个NON-NA的索引 - Finding index of a first NON-NA for a specific column in data frame 从数据帧中筛选出非NA条目,同时保留仅包含NA的行 - sieve out non-NA entries from data frame while retaining rows with only NA 按组汇总并获得不同数据的非NA值的计数,平均值和sd.frame列 - Aggregate by group and get count, mean and sd of non-NA values for different data.frame columns 在条件变为假之前计算数据帧每一行中非NA元素的数量 - count the number of non-NA elements in each row of a data frame before a condition becomes false 从 R 数据帧中的第一个非 NA 值创建“行” - Create “row” from first non-NA value in an R data frame R:如何组合具有相同id的数据帧的行并获取最新的非NA值? - R: How to combine rows of a data frame with the same id and take the newest non-NA value? 在 R 中,如何过滤数据框以仅包含具有 &gt;=2 个非 NA 值的行? - In R, How do I filter a data frame to only include rows with >=2 non-NA values? 如何获取每行的第一个非 NA 日期并将其作为新列添加到 r 下面的数据框中? - How to get the first non-NA date for each row and add it as a new column in the data frame below in r? 基于分组提取数据帧中最新的非NA值 - Extracting latest non-NA value in data frame based on grouping 汇总数据帧以沿子集返回非NA值 - Summarize data frame to return non-NA values along subsets
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM