简体   繁体   English

如何按组保留值(删除NA)?

[英]How to keep the values (dropping NAs) by groups?

My data has 3 proportion variables by geography and year. 我的数据按地理位置和年份有3个比例变量。 I am trying to aggregate this data by dropping the NAs and collating the values across 3 different variables by year and geography. 我正在尝试通过删除NA并按年份和地理位置整理3个不同变量的值来汇总此数据。

The example dataframe is as follows: 示例数据帧如下:

df <- data.frame(FIPS = c("01001", "01001", "01001","01001", "01001", "01001", "01003", "01003", "01003", "01003", "01003", "01003"),
                 Year = c(2000, 2000, 2000, 2001, 2001, 2001, 2000, 2000, 2000, 2001, 2001, 2001),
                 prop1 = c(0.7, NA, NA, 0.5, NA, NA, 0.3, NA, NA, 0.5, NA, NA),
                 prop2 = c(NA, 0.3, NA, NA, 0.5, NA, NA, 0.3, NA, NA, 0.1, NA),
                 prop3 = c(NA, NA, NA, NA, NA, NA, NA, NA, 0.4, NA, NA, 0.4))

I am guessing this can be done by aggregate or distinct command in R but not sure exactly how to proceed, as none of the way below gives me the dataframe I want. 我猜想这可以通过R中的聚合或不同命令来完成,但不确定确切如何进行,因为下面的方法都没有给我想要的数据框。

df2 = aggregate(df,by = list(df$FIPS, df$Year), FUN = ???)

df2 <- df %>% distinct(FIPS, Year, .keep_all = TRUE)

The expected dataframe is as follows: 预期的数据帧如下:

df2 <- data.frame(FIPS = c("01001", "01001",  "01003", "01003" ),
                  Year = c(2000,  2001,  2000,  2001),
                  prop1 = c(0.7,  0.5, 0.3, 0.5 ),
                  prop2 = c(0.3, 0.5,  0.3, 0.1),
                  prop3 = c(NA,  NA, 0.4, 0.4))

So basically, I want the code to search for the existing proportions (or NA if missing) in the 'prop' variables by Year and FIPS and create the new dataframe with unique FIPS and Year with the proportions collated. 因此,基本上,我希望代码按Year和FIPS在'prop'变量中搜索现有比例(如果缺少,则为NA),并创建具有唯一FIPS和Year并按比例进行排序的新数据框。 If anyone can point out the errors in what I am trying or give me another solution, it will be very much appreciated! 如果有人可以指出我正在尝试的错误或提供其他解决方案,将不胜感激!

You could use dplyr for this: 您可以为此使用dplyr

library(dplyr)
df %>%
  group_by(FIPS, Year) %>%
  summarise_at(vars(prop1:prop3), mean, na.rm = T) %>%
  replace(is.na(.), NA)
# A tibble: 4 x 5
# Groups:   FIPS [?]
  FIPS   Year prop1 prop2 prop3
  <fct> <dbl> <dbl> <dbl> <dbl>
1 01001  2000   0.7   0.3  NA  
2 01001  2001   0.5   0.5  NA  
3 01003  2000   0.3   0.3   0.4
4 01003  2001   0.5   0.1   0.4

In base R you can try, 在R底下您可以尝试,

do.call(rbind, lapply(split(df, list(df$FIPS, df$Year)), function(i) 
                                                     sapply(i, function(j) j[!is.na(j)][1])))

#           FIPS Year prop1 prop2 prop3
#01001.2000    1 2000   0.7   0.3    NA
#01003.2000    2 2000   0.3   0.3   0.4
#01001.2001    1 2001   0.5   0.5    NA
#01003.2001    2 2001   0.5   0.1   0.4

You can use data.table to achieve this- 您可以使用data.table实现此目的-

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)][1L]), by =.(FIPS,Year)]

Output- 输出-

FIPS Year prop1 prop2 prop3
1: 01001 2000   0.7   0.3    NA
2: 01001 2001   0.5   0.5    NA
3: 01003 2000   0.3   0.3   0.4
4: 01003 2001   0.5   0.1   0.4

Note - This will be efficient, if you have large dataset 注意 -如果您有大量数据集,这将非常有效

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM